Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5158

Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 3.0.0 PDFBox
    • 3.0.0 PDFBox
    • None
    • None

    Description

      I found a bunch of files that had a "read too many EOFs", which is a safety check we now do in TikaInputStream to identify parsers that read an EOF > 1000 times, which may be a sign of an infinite loop.

      When I turn off this safety check in TikaInputStream, I get an infinite loop.

      This is one of the triggering files: https://corpora.tika.apache.org/base/docs/commoncrawl3/OE/OELHPKYAQPDNDWC535NE23Z6FKYRMN7W

      It's a truncated file from Common Crawl.

      The stacktrace when this is thrown is:

      afterRead:809, TikaInputStream (org.apache.tika.io)
      read:82, ProxyInputStream (org.apache.commons.io.input)
      <init>:113, RandomAccessReadBuffer (org.apache.pdfbox.io)
      loadPDF:454, Loader (org.apache.pdfbox)
      loadPDF:430, Loader (org.apache.pdfbox)
      getPDDocument:189, PDFParser (org.apache.tika.parser.pdf)
      parse:148, PDFParser (org.apache.tika.parser.pdf)
      parse:288, CompositeParser (org.apache.tika.parser)
      parse:288, CompositeParser (org.apache.tika.parser)
      parse:150, AutoDetectParser (org.apache.tika.parser)
      parse:157, RecursiveParserWrapper (org.apache.tika.parser)
      getRecursiveMetadata:379, TikaTest (org.apache.tika)
      getRecursiveMetadata:369, TikaTest (org.apache.tika)
      getRecursiveMetadata:357, TikaTest (org.apache.tika)
      getRecursiveMetadata:351, TikaTest (org.apache.tika)
      
      
      

      Attachments

        Issue Links

          Activity

            People

              tilman Tilman Hausherr
              tallison Tim Allison
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: