Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2750

Update regression corpus

    XMLWordPrintableJSON

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      I think we've had great success with the current data on our regression corpus. I'd like to re-fresh some data from common crawl with three primary goals:

      1) include more interesting documents (e.g. down sample English UTF-8 text/html)
      2) include more recent documents (perhaps newer features in PDFs? definitely more ooxml)
      3) identify and re-fetch truncated documents from the original site – CommonCrawl truncates docs at 1 MB. I think some truncated documents have been quite useful, similar to fuzzing, for identifying serious problems with some of our parsers. However, it would be useful to have more complete files, esp. for PDFs. In short, we should keep some truncated documents, but I'd also like to get more complete docs.

        Attachments

        1. CC-MAIN-2018-39-mimes-v-detected.zip
          57 kB
          Tim Allison
        2. CC-MAIN-2018-39-mimes-charsets-by-tld.zip
          161 kB
          Tim Allison
        3. CC-MAIN-2018-39-charset_lang_by_tld.zip
          362 kB
          Tim Allison

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tallison@apache.org Tim Allison
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: