[TIKA-322] Improve encoding detection speed and accuracy - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.2
Component/s: mime
Labels:
None

Description

The encoding detection code we took from ICU4J is not very efficient and sometimes produces odd results when more than one encoding matches the given input data. It would be good to refactor the code to be faster for easy-to-detect encodings and to have better heuristics in case multiple matches are found.

Attachments

Issue Links

is related to

TIKA-369 Improve accuracy of language detection

Resolved

relates to

TIKA-333 Improve accuracy of charset detection for HTML pages

Closed

Activity

People

Assignee:: Jukka Zitting

Reporter:: Jukka Zitting

Votes:: 1 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 13/Nov/09 04:10

Updated:: 07/Jul/12 19:45

Resolved:: 07/Jul/12 19:45