Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-4337

Improvements to recent xps mods

    XMLWordPrintableJSON

Details

    • Task
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 4.0.0, 3.1.0
    • None
    • None

    Description

      I pulled 249 xps files out of the latest commoncrawl crawl and compared 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few number format exceptions where a comma-delimited string is parsed as if it were an integer.

      Reports are attached. See esp. new_exceptions_in_b_details.xlsx and content_diffs_no_exceptions.xlsx.

      The source files are available here: https://corpora.tika.apache.org/base/share/xps.tgz

      Attachments

        1. xps-reports.tgz
          204 kB
          Tim Allison

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tallison Tim Allison
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: