Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1639

Infinite loop with PDFParser used by tika.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.7.1, 1.8.2, 2.0.0
    • 1.8.3, 2.0.0
    • Parsing
    • None

    Description

      Hi,

      I encountered an issue in a production environment that cause a disk full error.
      Tika uses the PDFParser with the "forceParsing" boolean set to true in order to continue the parsing even if an error occurs.

      Two PDFs have an object number greater than the max int value so the readInt() method fails.
      Due to the "forceParsing" boolean, the parser try to go to the next object but it can't because on error the readInt method backtrack the read bytes and so
      the "skipToNextObj" method does nothing and we try to parse the same object indefinitely...

      The COSObjectKey object already uses a long as object numder, so we should read a long instead of an integer during the parsing process using a "readLong" method to manage too large objects numbers.

      Are you agreed with that ?

      BR,
      Eric

      Attachments

        1. PDFBox-1639-v2.patch
          15 kB
          Eric Leleu
        2. PDFBox-1639.patch
          13 kB
          Eric Leleu

        Activity

          People

            leleueri Eric Leleu
            leleueri Eric Leleu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: