[TIKA-2038] A more accurate facility for detecting Charset Encoding of HTML documents - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: core, detector
Labels:
None

Description

Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as the other naturally text documents. But the accuracy of encoding detector tools, including icu4j, in dealing with the HTML documents is meaningfully less than from which the other text documents. Hence, in our project I developed a library that works pretty well for HTML documents, which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet

Since Tika is widely used with and within some of other Apache stuffs such as Nutch, Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents, it seems that having such an facility in Tika also will help them to become more accurate.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

comparisons_20160803b.xlsx
03/Aug/16 20:05
104 kB
Tim Allison
comparisons_20160804.xlsx
04/Aug/16 14:57
115 kB
Tim Allison
iust_encodings.zip
01/Aug/16 12:51
7 kB
Tim Allison
lang-wise-eval_results.zip
19/Jan/17 19:22
2.51 MB
Shabanali Faghani
lang-wise-eval_runnable.zip
19/Jan/17 19:32
19.23 MB
Shabanali Faghani
lang-wise-eval_source_code.zip
19/Jan/17 19:26
9.53 MB
Shabanali Faghani
proposedTLDSampling.csv
07/Feb/17 13:41
1 kB
Tim Allison
tika_1_14-SNAPSHOT_encoding_detector.zip
29/Jul/16 19:25
5 kB
Tim Allison
tld_text_html_plus_H_column.xlsx
09/Feb/17 23:48
14 kB
Shabanali Faghani
tld_text_html.xlsx
08/Feb/17 12:29
13 kB
Tim Allison

Issue Links

contains

TIKA-2050 HTMLEncodingDetector class fails on some HTML documents

Resolved

depends upon

TIKA-2267 Add common tokens files for tika-eval

Resolved

relates to

TIKA-2771 enableInputFilter() wrecks charset detection for some short html documents

Open

TIKA-721 UTF16-LE not detected

Open

TIKA-2484 Improve CharsetDetector to recognize UTF-16LE/BE,UTF-32LE/BE and UTF-7 with/without BOMs correctly

Open

TIKA-2940 Consider an ensemble charset detection method

Open

TIKA-539 Encoding detection is too biased by encoding in meta tag

Reopened

TIKA-2750 Update regression corpus

Resolved

(3 relates to)

Activity

People

Assignee:: Unassigned

Reporter:: Shabanali Faghani

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 21/Jul/16 22:25

Updated:: 09/Sep/19 11:04