[PDFBOX-1039] Arabic Text Extraction using PDFTextStripper working partially - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: 1.5.0
Fix Version/s: None
Component/s: Text extraction
Labels:
- arabic
- textExtraction
Environment:
Windows XP, Java 1.6

Description

I have been trying to extract the contents of PDF file (so as to index it with lucene). The PDF file contains arabic.

Both PDF files contain the exact same information. The strange thing is PDFTextStripper extract data from one file correctly(gives proper arabic) but not from the other(gives complete question marks ???? or [][][][][] )

Below is the code being used

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

public class TesExtraction {

// Extract text from PDF Document
static String pdftoText(String fileName) {
PDFParser parser;
String parsedText = null;;
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File(fileName);
if (!file.isFile())

{ System.err.println("File " + fileName + " does not exist."); return null; }

try

{ parser = new PDFParser(new FileInputStream(file)); }

catch (IOException e)

{ System.err.println("Unable to open PDF Parser. " + e.getMessage()); return null; }

try

{ parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new PDFTextStripper("CP-1252"); pdDoc = new PDDocument(cosDoc); pdfStripper.setStartPage(1); pdfStripper.setEndPage(5); parsedText = pdfStripper.getText(pdDoc); }

catch (Exception e)

{ System.err .println("An exception occured in parsing the PDF Document." + e.getMessage()); }

finally {
try

{ if (cosDoc != null) cosDoc.close(); if (pdDoc != null) pdDoc.close(); }

catch (Exception e)

{ e.printStackTrace(); }

}
return parsedText;
}
public static void main(String args[])

{ System.out.println(pdftoText("C:\\LuceneTest\\Data\\TestWord.pdf")); System.out.println(pdftoText("C:\\LuceneTest\\Data\\TestPDFCreator.pdf")); }

}

NOTE: Where can I upload the pdf files ?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TestPDFCreator.pdf
17/Jun/11 08:04
15 kB
Franklin
TestWord.pdf
17/Jun/11 08:04
62 kB
Franklin

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Franklin

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 17/Jun/11 08:02

Updated:: 23/Mar/13 13:05

Resolved:: 18/Jun/11 12:37

Time Tracking

Estimated:

168h

Remaining:

168h

Logged:

Not Specified