Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-506

PDFBox can't parse PDF documents from jstor.org

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.3.1
    • None
    • None

    Description

      The academic repository JStor makes papers available via PDF format. The PDFs give this origin information:
      Content creator: JstorPdfGenerator v1.0
      PDF producer: iText 2.0.6 (by lowagie.com)

      These PDFs open fine in Acrobat, Preview, FoxIt, etc., but they throw an exception in PDFBox:

      Exception in thread "main" java.io.IOException: Error: Expected to read '%%EOF' instead started reading '1'
      at org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1005)
      at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:456)
      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:739)
      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:706)
      at org.apache.pdfbox.PDFDebugger.parseDocument(PDFDebugger.java:393)
      at org.apache.pdfbox.PDFDebugger.readPDFFile(PDFDebugger.java:369)
      at org.apache.pdfbox.PDFDebugger.main(PDFDebugger.java:355)

      I traced through the code, and it appears that PDFBox rejects these because they contain a 'startxref' that is not followed by a %%EOF two lines later:

      ...
      startxref
      613364
      1 0 obj
      ...

      Here's a small patch that will accept files that are missing the EOF after the startxref:

      Index: src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java
      ===================================================================
      — src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java (revision 802578)
      +++ src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java (working copy)
      @@ -453,11 +453,9 @@
      {
      parseStartXref();
      //verify that EOF exists

      • String eof = readExpectedString( "%%EOF" );
      • if( eof.indexOf( "%%EOF" )== -1 && !pdfSource.isEOF() )
      • {
      • throw new IOException( "expected='%%EOF' actual='" + eof + "' next=" + readString() +
      • " next=" +readString() );
        + int c = pdfSource.peek();
        + if (c == '%') { + readExpectedString("%%EOF"); }

        isEndOfFile = true;
        }

      Attachments

        1. siegel.pdf
          665 kB
          Dave Engberg

        Issue Links

          Activity

            People

              adamnichols Adam Nichols
              engberg Dave Engberg
              Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: