Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-978

unreading of trailing content after 'endobj' is missing new line byte (fix included)

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.0
    • Fix Version/s: 1.6.0
    • Component/s: Parsing
    • Labels:
      None

      Description

      I have several journal PDFs where the last xref section starts like

      endobj xref
      0 92
      0000000000 65535 f
      0000000044 00000 n

      in this cases the PDF parser reads the endobj line completely and unreads " xref".
      However the newline (in this case ^D) is lost. This is already documented in the
      method readline() within PDFParser:
      "Note: if you later unread the results of this function, you'll
      need to add a newline character to the end of the string."

      Currently I get an error like: "expected='obj' actual='655'" because the 'xref' is read as 'xref0'.

      The fix:
      in PDFParser insert before line 579 (the unreading of trailing characters after 'endobj') the lines:

      // add a space first in place of the newline consumed by readline()
      pdfSource.unread( SPACE_BYTE );

      thus we get:
      if (endObjectKey.startsWith( "endobj" ) )

      { /* * Some PDF files don't contain a new line after endobj so we * need to make sure that the next object number is getting read separately * and not part of the endobj keyword. Ex. Some files would have "endobj28" * instead of "endobj" */ // add a space first in place of the newline consumed by readline() pdfSource.unread( SPACE_BYTE ); pdfSource.unread( endObjectKey.substring( 6 ).getBytes("ISO-8859-1") ); }

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                adamnichols Adam Nichols
                Reporter:
                tboehme Timo Boehme
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 5m
                  5m
                  Remaining:
                  Remaining Estimate - 5m
                  5m
                  Logged:
                  Time Spent - Not Specified
                  Not Specified