Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
-
None
-
None
Description
I found two files in the regression tests where we're now getting timeouts at 3 minutes where we weren't before. Unfortunately, PDFBox's export:text works on both, so it is probably another structural feature, perhaps a problem in Tika?
This file halts after printing out the header for Table 19 on page 46: https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf
Pure PDFBox's export:text complains multiple times: "Page skipped due to an invalid or missing type null, but it does finish quickly."
This file halts after extracting "854,793,592":
https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY
Pure PDFBox's export:text processes this without problem.