[TIKA-2750] Update regression corpus - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

I think we've had great success with the current data on our regression corpus. I'd like to re-fresh some data from common crawl with three primary goals:

1) include more interesting documents (e.g. down sample English UTF-8 text/html)
2) include more recent documents (perhaps newer features in PDFs? definitely more ooxml)
3) identify and re-fetch truncated documents from the original site – CommonCrawl truncates docs at 1 MB. I think some truncated documents have been quite useful, similar to fuzzing, for identifying serious problems with some of our parsers. However, it would be useful to have more complete files, esp. for PDFs. In short, we should keep some truncated documents, but I'd also like to get more complete docs.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

CC-MAIN-2018-39-charset_lang_by_tld.zip
05/Nov/18 15:37
362 kB
Tim Allison
CC-MAIN-2018-39-mimes-charsets-by-tld.zip
26/Oct/18 15:35
161 kB
Tim Allison
CC-MAIN-2018-39-mimes-v-detected.zip
01/Nov/18 18:16
57 kB
Tim Allison

Issue Links

is related to

TIKA-2038 A more accurate facility for detecting Charset Encoding of HTML documents

Open

Activity

People

Assignee:: Unassigned

Reporter:: Tim Allison

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 05/Oct/18 13:24

Updated:: 30/Jul/19 20:26

Resolved:: 30/Jul/19 20:26