Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
1.4
-
None
-
None
Description
We are using Tika on large web archives, which also happen to contain some malformed files. In particular, we found a HTML file with binary characters in the DOCTYPE declaration. This hangs Tika, either embedded or from the command line, during format detection.
An example file is attached.
Attachments
Attachments
Issue Links
- is related to
-
TIKA-1576 Upgrade metadata-extractor to version 2.7.2
- Resolved
- relates to
-
NUTCH-2223 Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection
- Closed