Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1042

Wrong XRefStream order while parsing incremental updated PDF with XRefStreams

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 1.5.0
    • None
    • Parsing
    • None

    Description

      A PDF can contain two types of XRef-Entries.
      Most files use XRefTables for object references.

      Web-Optimized (linearized) pdf document uses XRefStreams. This is a compresed XRefTable as ObjectStream. The PDFParser parse this objects the same way as other objects and put them into an object pool (HashMap). If the document was incremental updated, more XRefStreams would be in the pdf document and all will be put into the object pool.

      The XRefStreamParser begin to parse the XRefStreams and try to gain all XRefStream-Object from that pool. The objects returned from the pool aren't in the same order as read. This cause that in some cases the older Object overwrite the newer one. And this cause that the pdfbox can't find the right objects and use the older one instead.

      If a user try to parse such a document, he will got an indeterminate state. older and newer objects are mixed.

      In my case, a document catalog was overwrote by an old one and i can't see the changes that was made with the incremental update.

      A patch and a sample pdf will come soon.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tchojecki Thomas Chojecki
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: