[PDFBOX-269] ExtractText errors - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.8.0-incubator
Component/s: Text extraction
Labels:
None

Description

[imported from SourceForge]
http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1706491
Originally submitted by wrwessel on 2007-04-24 04:31.

Wrote a batch file to convert over 500 powerpoint files I had to pdf (using DocumentConverter.py and OpenOffice) then the batch file uses ExtractText.exe to extract the text. Most of these files converted fine but I had 4 files where ExtractText could not get any text and threw various error messages. I have attached one of these as a sample. Using version 0.7.4 from 19/5/07 and same problem with 0.7.3 release. It is easy enough for me to convert the last 4 by hand, but might be a bug you can fix.

Many thanks for the ExtractText program, saved a long time converting these by hand.

[attachment on SourceForge]
http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1706491&file_id=226382
Sample.zip (application/x-zip-compressed), 216343 bytes

[comment on SourceForge]
Originally sent by benlitchfield.
Logged In: YES
user_id=601708
Originator: NO

I've looked at the attached PDF, technically I believe the root issue is that OpenOffice is not correctly writing the PDF. I have submitted the issue with those guys and can be monitored by going to http://www.openoffice.org/issues/show_bug.cgi?id=76879

The issue is that the PDF is sometimes missing the 'endstream' tag; which PDFBox looks for to tell it that the stream is done.

My rule of thumb is that if Acrobat can open it, then so should PDFBox, so this is still a 'bug' with PDFBox. Fixing this is possible but is not straightforward, so it may be a little bit before it is complete.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

endstream_missing_fix.diff
22/Apr/09 20:33
6 kB
Justin LeFebvre

Activity

People

Assignee:: Unassigned

Reporter:: Jukka Zitting

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 24/Apr/07 11:31

Updated:: 04/Aug/14 20:39

Resolved:: 22/Apr/09 21:24