[TIKA-2671] HtmlEncodingDetector doesnt take provided metadata into account - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: detector
Labels:
None

Description

org.apache.tika.parser.html.HtmlEncodingDetector ignores the document's metadata. So when using it to detect the charset of an HTML document that came with a conflicting charset specified at the transport layer level, the encoding specified inside the file is used instead.

This behavior does not conform to what is specified by the W3C for determining the character encoding of HTML pages. This causes bugs similar to NUTCH-2599.

If HtmlEncodingDetector is not meant to take into account meta-information about the document, then maybe another detector should be provided, that would be a CompositeDetector including, in that order:

a new, simple, MetadataEncodingDetector, that would simply return the encoding
the existing HtmlEncodingDetector
a generic detector, like UniversalEncodingDetector

Attachments

Issue Links

is depended upon by

NUTCH-2599 charset detection issue with parse-tika

Reopened

Activity

People

Assignee:: Unassigned

Reporter:: Gerard Bouchar

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 15/Jun/18 15:45

Updated:: 27/Jun/18 13:04