[TIKA-333] Improve accuracy of charset detection for HTML pages - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Not A Problem
Affects Version/s: 0.5
Fix Version/s: None
Component/s: None
Labels:
None

Description

Charset detection for HTML pages doesn't work all that well currently, due to the amount of text that's HTML markup at the beginning of the document.

A simple solution would be to skip over the first 2K (assuming the document is long enough) before passing bytes to ICU4J.

A more complex solution would be to scan for title and body tags, and pass bytes found in each.

Attachments

Issue Links

is related to

TIKA-332 Use http-equiv meta tag charset info when processing HTML documents

Resolved

TIKA-322 Improve encoding detection speed and accuracy

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Kenneth William Krugler

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 25/Nov/09 17:52

Updated:: 25/Nov/09 18:36

Resolved:: 25/Nov/09 18:36