Description
I pulled 249 xps files out of the latest commoncrawl crawl and compared 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few number format exceptions where a comma-delimited string is parsed as if it were an integer.
Reports are attached. See esp. new_exceptions_in_b_details.xlsx and content_diffs_no_exceptions.xlsx.
The source files are available here: https://corpora.tika.apache.org/base/share/xps.tgz
Attachments
Attachments
Issue Links
- is related to
-
TIKA-4315 XPS file parser does not emit whitespace as expected
- Resolved
- links to