Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2016

Stream parsing still incorrect if length value is wrong

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.0, 1.8.4
    • Fix Version/s: 1.8.5, 2.0.0
    • Component/s: Parsing
    • Labels:
      None

      Description

      From issue PDFBOX-1333 - "In 1.7.0 stream parsing in BaseParser was optimized to use length value if available. The advantage is faster parsing and independence of 'endstream' bytes sequences in stream. However the disadvantage is that streams with wrong length values cannot be parsed anymore" - etc.

      This issue was marked as fixed now that COSStreams can once again be parsed by reading all the way to 'endstream'. However, the resulting COSStream object still contains the expected length, not the true length. When parsing the COSStream with a PDFStreamParser, the call to COSStream#getUnfilteredStream uses getLength() instead of getLengthWritten to limit the amount of data that can be read. This can truncate the stream and means that incorrect length values still lead to missing data, and so limits the usefulness of the last fix. Changing the call to getLengthWritten should solve the problem.

        Attachments

        1. Hello.pdf
          0.7 kB
          Andrew Olsen
        2. Hello_broken.pdf
          0.7 kB
          Andrew Olsen

          Issue Links

            Activity

              People

              • Assignee:
                tilman Tilman Hausherr
                Reporter:
                andy_k Andrew Olsen
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 2h
                  2h
                  Remaining:
                  Remaining Estimate - 2h
                  2h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified