Description
Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as the other naturally text documents. But the accuracy of encoding detector tools, including icu4j, in dealing with the HTML documents is meaningfully less than from which the other text documents. Hence, in our project I developed a library that works pretty well for HTML documents, which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
Since Tika is widely used with and within some of other Apache stuffs such as Nutch, Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents, it seems that having such an facility in Tika also will help them to become more accurate.
Attachments
Attachments
Issue Links
- contains
-
TIKA-2050 HTMLEncodingDetector class fails on some HTML documents
- Resolved
- depends upon
-
TIKA-2267 Add common tokens files for tika-eval
- Resolved
- relates to
-
TIKA-2771 enableInputFilter() wrecks charset detection for some short html documents
- Open
-
TIKA-721 UTF16-LE not detected
- Open
-
TIKA-2484 Improve CharsetDetector to recognize UTF-16LE/BE,UTF-32LE/BE and UTF-7 with/without BOMs correctly
- Open
-
TIKA-2940 Consider an ensemble charset detection method
- Open
-
TIKA-539 Encoding detection is too biased by encoding in meta tag
- Reopened
-
TIKA-2750 Update regression corpus
- Resolved