Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5683

Inconsistent/incomplete PDF rendering

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 2.0.29, 3.0.0 PDFBox, 4.0.0
    • 2.0.30, 3.0.1 PDFBox, 4.0.0
    • Parsing

    Description

      We have integrated tika and its default parsers in a Forensic Tool (IPED). As a forensic tool, it tries to recover/carve deleted PDF files, some of which can be partially recovered.

      PDFParser throws Exception if there is no PDF version header on the parsed content, avoiding any further content parsing.

      Commenting this exception, the parser still throws the exception "Missing root object specification in trail" in initialParse method, as this root object is normally at begin of a PDF file.

      Although, I could made some simple effort to build a "fake" root COSDictionary and build the PAGES entry with the recoverable PAGEs, searching them from document.getXrefTable();

          protected void initialParse() throws IOException
          {
              COSDictionary trailer = retrieveTrailer();
          
              COSDictionary root = trailer.getCOSDictionary(COSName.ROOT);
              if (root == null)
              {
                  // rebuild root from xref recovered info
                  root = new COSDictionary();
                  root.setItem(COSName.TYPE, COSName.CATALOG);
                  trailer.setItem(COSName.ROOT, root);
      
                  // identify recovered pages from xref to mount COSName.PAGES
                  Map<COSObjectKey, Long> xrefTable = document.getXrefTable();
                  COSArray kids = new COSArray();
                  for (Entry<COSObjectKey, Long> e : xrefTable.entrySet()) {
                      COSObject o = document.getObjectFromPool(e.getKey());
      
                      if (o.getObject() instanceof COSDictionary) {
                          COSDictionary d = (COSDictionary) o.getObject();
      
                          COSName type = d.getCOSName(COSName.TYPE);
                          if (type != null) {
                              if (type.equals(COSName.PAGE)) {
                                  kids.add(d);
                              }
                          }
                      }
                  }
                  
                  COSDictionary pages = new COSDictionary();
                  pages.setItem(COSName.TYPE, COSName.PAGES);
                  pages.setItem(COSName.COUNT, COSInteger.get(kids.size()));
                  pages.setItem(COSName.KIDS, kids);
                  document.setDecrypted();
                  root.setItem(COSName.PAGES, pages);
                  initialParseDone = true;
                  return;
                  // throw new IOException("Missing root object specification in trailer.");
              }
              // in some pdfs the type value "Catalog" is missing in the root object
              if (isLenient() && !root.containsKey(COSName.TYPE))
              {
                  root.setItem(COSName.TYPE, COSName.CATALOG);
              }
              // check pages dictionaries
              checkPages(root);
              document.setDecrypted();
              initialParseDone = true;
          }
      

      This simple effort was enough to show the recoverable pages on PDFDebugger, export XMP metadatas, text, and get the rendered pages buffered images to use on my OCR module.

      So, it would be very useful if PDFBOX already have some optional parameterizable mode to open/recover inconsistent/incomplete pdf file, with at least the implementation above or further recover actions.

      Attachments

        1. pdf1.pdf
          5.62 MB
          Patrick Dalla Bernardina

        Activity

          People

            lehmi Andreas Lehmkühler
            patrickdalla Patrick Dalla Bernardina
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 72h
                72h
                Remaining:
                Remaining Estimate - 72h
                72h
                Logged:
                Time Spent - Not Specified
                Not Specified