[TIKA-2933] Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.

I'm finally getting around to running the comparisons between our legacy HTMLEncodingDetector and the newer StandardHTMLEncodingDetector. More analysis is required, but the newer one is, generally better*. One area for improvement/explanation, though is in the "replacement" encoding.

There are 1 million more "common words" in text extracted from files with the StandardHtmlEncodingDetector than with only our legacy. There are 133M common words in our legacy extracts so that's less than 1% improvement.

Attachments

Issue Links

is related to

TIKA-2673 HtmlEncodingDetector doesn't follow the specification

Resolved

relates to

TIKA-2937 Improve legacy HTML charset detector by replicating Standard's behavior for UTF-16

Open

TIKA-2940 Consider an ensemble charset detection method

Open

Activity

People

Assignee:: Unassigned

Reporter:: Tim Allison

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 30/Aug/19 12:08

Updated:: 09/Sep/19 11:04