[TIKA-2940] Consider an ensemble charset detection method - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

I recently ran our four charset detectors against our text based files.

The raw data is available here:
http://162.242.228.174/encoding_detection/charsets_combined_201909.sql.zip (in sql form) or http://162.242.228.174/encoding_detection/charsets_combined_201909.csv.zip (in a csv).

I've posted a preliminary/draft report here: https://github.com/tballison/share/blob/master/slides/Tika_charset_detector_study_201909.docx

In general, we could see a ~1.4% improvement in "common tokens"[0] if we used an ensemble approach on our corpus. For users with more homogeneous documents, this improvement could be far greater (e.g. if their documents all come from a content management system that is applying an incorrect html-meta charset header).

I'm opening this issue for discussion and as encouragement for others to work with the raw data and/or make recommendations on the preliminary report's methodology.

[0] "common tokens" in tika-eval refers to the lists we developed of the top 30k most common words per 118 languages covered in tika-eval. It can be a sign of improved extraction if the total number of "common tokens" increases.

Attachments

Issue Links

is related to

TIKA-2933 Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.

Open

TIKA-2038 A more accurate facility for detecting Charset Encoding of HTML documents

Open

TIKA-2936 The stricter StandardHtmlDetector extracts some header charsets where our legacy detector doesn't

Open

TIKA-2937 Improve legacy HTML charset detector by replicating Standard's behavior for UTF-16

Open

Activity

People

Assignee:: Unassigned

Reporter:: Tim Allison

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Sep/19 11:03

Updated:: 09/Sep/19 11:04