Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Won't Fix
-
1.1.0
-
None
-
Windows XP + Eclipse + PDFBox sources
Description
Hi,
I have noticed that I can extract text some PDF files in PDFBox 0.7.4 but for the same file, the same page, PDFBox 1.1.0 doesn't retreive any text, or the extraction is worst.
Am I the only only one who think there is a regression in text extraction ?
My code is like this :
PDDocument document = PDDocument.load("/sdcard/internals.pdf"); int numberOfPages = document.getNumberOfPages(); resources = this.getResources(); android.util.Log.d(TEST_PDFBOX, "readerPDF() resources : "+resources); // ANDROID code here to get file resourceGlyphList = R.raw.glyphlist; InputStream rawResource = resources.openRawResource(R.raw.pdftextstripper); // PDFBOX property file android.util.Log.d(TEST_PDFBOX, "readerPDF() rawResource : "+rawResource); Properties properties = new Properties(); properties.load(rawResource); PDFTextStripper stripper = new PDFTextStripper(properties ); stripper.setStartPage(pageNumber ); // 1 or any other page stripper.setEndPage(pageNumber ); // same page as above String s = "Page : "+pageNumber+"<br><br>"+stripper.getText(document); android.util.Log.d(TEST_PDFBOX, "readerPDF() stripper extract pages text : "+s);
Maybe I should use page.getContents().getStream() or stripper.getTextForRegion( "class1" ) or stripper.writeText(doc, outputStream)
I want the text as a String, not as a newly created file....
Attachments
Attachments
Issue Links
- is part of
-
PDFBOX-1962 Refactor the packages in the core pdfbox module
- Closed
- relates to
-
PDFBOX-1182 Create a module for the commandline tools
- Closed
-
PDFBOX-1177 Create a module with examples instead having them in pdfbox.jar
- Closed
- requires
-
PDFBOX-812 Remove dependency on PageDrawer from text only operators
- Closed