Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
3.0.0 PDFBox
-
None
-
None
Description
I found a bunch of files that had a "read too many EOFs", which is a safety check we now do in TikaInputStream to identify parsers that read an EOF > 1000 times, which may be a sign of an infinite loop.
When I turn off this safety check in TikaInputStream, I get an infinite loop.
This is one of the triggering files: https://corpora.tika.apache.org/base/docs/commoncrawl3/OE/OELHPKYAQPDNDWC535NE23Z6FKYRMN7W
It's a truncated file from Common Crawl.
The stacktrace when this is thrown is:
afterRead:809, TikaInputStream (org.apache.tika.io) read:82, ProxyInputStream (org.apache.commons.io.input) <init>:113, RandomAccessReadBuffer (org.apache.pdfbox.io) loadPDF:454, Loader (org.apache.pdfbox) loadPDF:430, Loader (org.apache.pdfbox) getPDDocument:189, PDFParser (org.apache.tika.parser.pdf) parse:148, PDFParser (org.apache.tika.parser.pdf) parse:288, CompositeParser (org.apache.tika.parser) parse:288, CompositeParser (org.apache.tika.parser) parse:150, AutoDetectParser (org.apache.tika.parser) parse:157, RecursiveParserWrapper (org.apache.tika.parser) getRecursiveMetadata:379, TikaTest (org.apache.tika) getRecursiveMetadata:369, TikaTest (org.apache.tika) getRecursiveMetadata:357, TikaTest (org.apache.tika) getRecursiveMetadata:351, TikaTest (org.apache.tika)
Attachments
Issue Links
- duplicates
-
PDFBOX-5543 pdfbox 3.0.0-RC1 | Loader.loadPDF(inputStream) is getting stuck for few pdf files
- Closed