Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1299

BaseParser.readUntilEndOfStream can stop too early, causing IOException on valid PDFs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.6.0
    • 1.7.0
    • None
    • None

    Description

      The purpose of BaseParser.readUntilEndOfStream is to scan ahead,
      copying bytes to the output, stopping once it sees "endstream".

      The problem with this approach is sometimes the stream data itself
      contains endstream causing readUntilEndOfStream to stop too early.

      This can legitimately happen when the stream is an embedded PDF; I'll
      attach a test PDF showing this.

      However, the stream dict declares the stream length (in bytes)... so
      it seems like we should be respecting that length (if present) and
      simply copy over that many bytes, instead of scanning the stream bytes
      for endstream? This should be a lot faster too...

      I imagine we always scan so that we are more robust if the length is
      missing/invalid? Is that why this method was used? (I don't know the
      history here...). If so, maybe we can have an option to use
      the declared stream length if present.

      I have a patch to use the declared stream length (if present), and it enables
      at least this test PDF to correctly parse.

      Attachments

        1. PDFBOX-1299.patch
          2 kB
          Michael McCandless
        2. Tracey_Prather_31-Dec-2010_211843_2011Portfolio.pdf
          6.89 MB
          Michael McCandless

        Activity

          People

            tboehme Timo Boehme
            mikemccand Michael McCandless
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: