[TIKA-332] Use http-equiv meta tag charset info when processing HTML documents - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.5
Fix Version/s: 0.6
Component/s: None
Labels:
None

Description

Currently Tika doesn't use the charset info that's optionally present in HTML documents, via the <meta http-equiv="Content-type" content="text/html; charset=xxx"> tag.

If the mime-type is detected as being one that's handled by the HtmlParser, then the first 4-8K of text should be converted from bytes to us-ascii, and then scanned using a regex something like:

private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = Pattern.compile("<meta\\s+http-equiv\\s*=\\s*['\"]\\s*Content-Type['\"]\\s+content\\s*=\\s*['\"][^;];\\s*charset\\s*=\\s*([^'\"])\"");

If a charset is detected, this should take precedence over a charset in the HTTP response headers, and (obviously) used to convert the bytes to text before the actual parsing of the document begins.

In a test I did of 100 random HTML pages, roughly 15% contained charset info in the meta tag that wound up being different from the detected or HTTP response header charset, so this is a pretty important improvement to make. Without it, Tika isn't that useful for processing HTML pages.

Though the other problem is that the HtmlParser code doesn't use the CharsetDetector, which is another reason for lots of incorrect text. I'll file a separate issue about that.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TIKA-332.patch
03/Dec/09 05:18
4 kB
Kenneth William Krugler
TIKA-332-2.patch
03/Dec/09 05:26
2 kB
Kenneth William Krugler

Issue Links

relates to

TIKA-333 Improve accuracy of charset detection for HTML pages

Closed

Activity

People

Assignee:: Jukka Zitting

Reporter:: Kenneth William Krugler

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 25/Nov/09 17:49

Updated:: 13/Dec/09 00:25

Resolved:: 13/Dec/09 00:25