Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
-
None
Description
[imported from SourceForge]
http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1706491
Originally submitted by wrwessel on 2007-04-24 04:31.
Wrote a batch file to convert over 500 powerpoint files I had to pdf (using DocumentConverter.py and OpenOffice) then the batch file uses ExtractText.exe to extract the text. Most of these files converted fine but I had 4 files where ExtractText could not get any text and threw various error messages. I have attached one of these as a sample. Using version 0.7.4 from 19/5/07 and same problem with 0.7.3 release. It is easy enough for me to convert the last 4 by hand, but might be a bug you can fix.
Many thanks for the ExtractText program, saved a long time converting these by hand.
[attachment on SourceForge]
http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1706491&file_id=226382
Sample.zip (application/x-zip-compressed), 216343 bytes
[comment on SourceForge]
Originally sent by benlitchfield.
Logged In: YES
user_id=601708
Originator: NO
I've looked at the attached PDF, technically I believe the root issue is that OpenOffice is not correctly writing the PDF. I have submitted the issue with those guys and can be monitored by going to http://www.openoffice.org/issues/show_bug.cgi?id=76879
The issue is that the PDF is sometimes missing the 'endstream' tag; which PDFBox looks for to tell it that the stream is done.
My rule of thumb is that if Acrobat can open it, then so should PDFBox, so this is still a 'bug' with PDFBox. Fixing this is possible but is not straightforward, so it may be a little bit before it is complete.