Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
The academic repository JStor makes papers available via PDF format. The PDFs give this origin information:
Content creator: JstorPdfGenerator v1.0
PDF producer: iText 2.0.6 (by lowagie.com)
These PDFs open fine in Acrobat, Preview, FoxIt, etc., but they throw an exception in PDFBox:
Exception in thread "main" java.io.IOException: Error: Expected to read '%%EOF' instead started reading '1'
at org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1005)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:456)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:739)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:706)
at org.apache.pdfbox.PDFDebugger.parseDocument(PDFDebugger.java:393)
at org.apache.pdfbox.PDFDebugger.readPDFFile(PDFDebugger.java:369)
at org.apache.pdfbox.PDFDebugger.main(PDFDebugger.java:355)
I traced through the code, and it appears that PDFBox rejects these because they contain a 'startxref' that is not followed by a %%EOF two lines later:
...
startxref
613364
1 0 obj
...
Here's a small patch that will accept files that are missing the EOF after the startxref:
Index: src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java
===================================================================
— src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java (revision 802578)
+++ src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java (working copy)
@@ -453,11 +453,9 @@
{
parseStartXref();
//verify that EOF exists
- String eof = readExpectedString( "%%EOF" );
- if( eof.indexOf( "%%EOF" )== -1 && !pdfSource.isEOF() )
- {
- throw new IOException( "expected='%%EOF' actual='" + eof + "' next=" + readString() +
- " next=" +readString() );
+ int c = pdfSource.peek();
+ if (c == '%') { + readExpectedString("%%EOF"); }isEndOfFile = true;
}
Attachments
Attachments
Issue Links
- is duplicated by
-
PDFBOX-311 Expected to read '%%EOF' instead started reading 'e'
- Closed
- is related to
-
PDFBOX-802 Better handle corrupt/missing %%EOF flags at the end of a file
- Closed