Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5152

Content Stream Appears Truncated in Specific File

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Not A Bug
    • Affects Version/s: 2.0.23
    • Fix Version/s: None
    • Component/s: Parsing
    • Labels:
      None

      Description

      I'm working on a utility to invert the colors of a PDF file. An issue was raised, which provided a PDF file, which when parsed by pdfbox, appears to give a truncated content stream. That is, running the following code results in a substantially shorter content stream than I would expect:

      try (PDDocument doc = PDDocument.load(/* January.pdf */)) {
        for (PDPage page: doc.getPages()) {
          String stream = new String(IOUtils.toByteArray(page.getContents()), StandardCharsets.UTF_8);
          System.out.println(stream);
        }
      }
      

       The code outputs the following:

      q 0 0 0 rg 0 0 0 RG /GS0 gs /Fm0 Do Q 
      

      I'll admit that I don't have the strongest of understandings of PDF content streams, but I can fairly confidently say that more than this is required to draw page 1 of the PDF.

      Additionally, you can deduce from the linked issue that, internally, pdfbox is making reference to additional data that isn't contained in the content stream returned from page.getContents().

      In my program, I need to find specific substrings in the content stream to locate specific operations and their arguments. To do so, I wrap PDFStreamParser.parseNextToken() with queries to PDFStreamParser.seqSource.getPosition(). I do so in order to get the bounds of a token in the content stream, without the need to parse it myself, (allowing parseNextToken to do the work for me.) When I look at the bounds which these queries give me, they extend further than the length of the content stream returned by page.getContents().

      Specifically, one set of these bounds is (19, 313), inclusive. In other words, the token parsed by parseNextToken corresponds to characters 19-313 (inclusive, 0-based index) of the content stream. But the content stream returned by page.getContents() doesn't contain 313 characters.

      Hopefully someone can shed some light on this issue for me. Thanks!

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              acid1103 Steven Fontaine
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: