[TIKA-2050] HTMLEncodingDetector class fails on some HTML documents - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: parser
Labels:
None

Description

When tallison@mitre.org and I were working on TIKA-2038 I found out that HTMLEncodingDetector class cannot extract charsets from some HTML documents. I’ve attached the HTML documents that HTMLEncodingDetector fails on them. It seems that its regex should be corrected to cover these cases.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

false-negative-responce-from-HTMLEncodingDetector.zip
05/Aug/16 21:21
146 kB
Shabanali Faghani

Issue Links

Is contained by

TIKA-2038 A more accurate facility for detecting Charset Encoding of HTML documents

Open

Activity

People

Assignee:: Unassigned

Reporter:: Shabanali Faghani

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 05/Aug/16 21:20

Updated:: 11/Aug/16 12:33

Resolved:: 11/Aug/16 12:33