Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Won't Fix
-
None
-
None
-
None
Description
When tallison@mitre.org and I were working on TIKA-2038 I found out that HTMLEncodingDetector class cannot extract charsets from some HTML documents. I’ve attached the HTML documents that HTMLEncodingDetector fails on them. It seems that its regex should be corrected to cover these cases.
Attachments
Attachments
Issue Links
- Is contained by
-
TIKA-2038 A more accurate facility for detecting Charset Encoding of HTML documents
- Open