Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1639

Infinite loop with PDFParser used by tika.

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.7.1, 1.8.2, 2.0.0
    • Fix Version/s: 1.8.3, 2.0.0
    • Component/s: Parsing
    • Labels:
      None

      Description

      Hi,

      I encountered an issue in a production environment that cause a disk full error.
      Tika uses the PDFParser with the "forceParsing" boolean set to true in order to continue the parsing even if an error occurs.

      Two PDFs have an object number greater than the max int value so the readInt() method fails.
      Due to the "forceParsing" boolean, the parser try to go to the next object but it can't because on error the readInt method backtrack the read bytes and so
      the "skipToNextObj" method does nothing and we try to parse the same object indefinitely...

      The COSObjectKey object already uses a long as object numder, so we should read a long instead of an integer during the parsing process using a "readLong" method to manage too large objects numbers.

      Are you agreed with that ?

      BR,
      Eric

        Attachments

        1. PDFBox-1639-v2.patch
          15 kB
          Eric Leleu
        2. PDFBox-1639.patch
          13 kB
          Eric Leleu

          Activity

            People

            • Assignee:
              leleueri Eric Leleu
              Reporter:
              leleueri Eric Leleu
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: