[PDFBOX-586] Text Extraction on Android - ASF JIRA

Voters

Watch issue

Watchers

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 1.1.0
Fix Version/s: None
Component/s: Text extraction
Labels:
- modularization
Environment:
Windows XP + Eclipse + PDFBox sources

Description

Hi,

I have noticed that I can extract text some PDF files in PDFBox 0.7.4 but for the same file, the same page, PDFBox 1.1.0 doesn't retreive any text, or the extraction is worst.

Am I the only only one who think there is a regression in text extraction ?

My code is like this :

   PDDocument document = PDDocument.load("/sdcard/internals.pdf"); 
    int numberOfPages = document.getNumberOfPages();
    resources = this.getResources();
  
  android.util.Log.d(TEST_PDFBOX, "readerPDF() resources : "+resources);  // ANDROID code here to get file
   resourceGlyphList = R.raw.glyphlist;
   InputStream rawResource = resources.openRawResource(R.raw.pdftextstripper);   // PDFBOX property file
   android.util.Log.d(TEST_PDFBOX, "readerPDF() rawResource : "+rawResource);
   Properties properties = new Properties();
    properties.load(rawResource);
    		
   PDFTextStripper stripper = new PDFTextStripper(properties );
    		
  stripper.setStartPage(pageNumber );    //   1 or any other page
  stripper.setEndPage(pageNumber );   // same page as above

   String s = "Page : "+pageNumber+"<br><br>"+stripper.getText(document);
   android.util.Log.d(TEST_PDFBOX, "readerPDF()  stripper extract pages text : "+s);

Maybe I should use page.getContents().getStream() or stripper.getTextForRegion( "class1" ) or stripper.writeText(doc, outputStream)

I want the text as a String, not as a newly created file....

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Manage Attachments

ASEB-Camping_Car_ou_Bateau.pdf
18/May/10 15:17
86 kB
Bernard
EncryptedFileTest_AES.pdf
05/Mar/11 16:56
44 kB
Eddie B
EncryptedFileTest_RC4.pdf
05/Mar/11 16:56
43 kB
Eddie B
Eval.pdf
18/May/10 15:17
192 kB
Bernard
internals.pdf
18/May/10 17:47
43 kB
Bernard
PDFBOX586-ASEB-Camping_Car_ou_Bateau.txt
18/May/10 19:00
2 kB
Andreas Lehmkühler
PDFBOX586-Eval.txt
18/May/10 19:00
26 kB
Andreas Lehmkühler
PDFBOX586-internals.txt
18/May/10 19:00
15 kB
Andreas Lehmkühler
TestPDFBox.zip
02/Aug/10 07:40
5.93 MB
Bernard