Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-798

Better handle out of spec PDFs

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.3.1
    • Parsing
    • None
    • 32-bti Windows Vista, Java 1.5, HEAD tag of PDFBox

    Description

      I came across another out-of-spec issue which causes PDFBox to crash. Here's the object:
      5 0 obj
      <</Type /Page
      /Parent 6 0 R
      /MediaBox [ 0 0 610.560 783.360
      endstream
      endobj

      There are numerous issues here. The mediabox doesn't have a closing right square bracket, there's no ">>" to end the dictionary, and there's an "endstream" stuck in there for no apparent reason. This is something I actually found out in the wild, however I do not know if it's a bug in the creation program, some data corruption or how this happened. However, I do know that Adobe Reader parses it without crashing. Since this is not a conforming PDF, the result is undefined, so crashing (which is what PDFBox will eventually do, when trying to process the next object in the file) is a perfectly acceptable thing to do.

      However, I'd like to make PDFBox be able to detect that the array is completed when it sees endstream, then ignore the rogue endstream, and then know that the object has ended when it sees "endobj". I'm actually going to go one step further and also accept the same object even if endstream or endobj is missing. In addition to the above object, I also tested it with these objects:

      % end obj, without the endstream
      5 0 obj
      <</Type /Page
      /Parent 6 0 R
      /MediaBox [ 0 0 610.560 783.360
      endobj

      % end endstream, without the endobj
      5 0 obj
      <</Type /Page
      /Parent 6 0 R
      /MediaBox [ 0 0 610.560 783.360
      endstream

      % properly ended array, dictionary and object (aka conforming PDF)
      5 0 obj
      <</Type /Page
      /Parent 6 0 R
      /MediaBox [ 0 0 610.560 783.360 ]
      >>
      endobj

      Although this change will only affect PDFs which do not conform to the spec, I want to put the patch up for review before committing it to SVN since it is a modification to BaseParser.java. If I do not hear any objections/concerns in the few days, I'll go ahead an commit it.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            adamnichols Adam Nichols
            adamnichols Adam Nichols
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment