[TIKA-341] Use charset in CONTENT_TYPE metadata when detecting the character encoding - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 0.6
Fix Version/s: 0.6
Component/s: None
Labels:
None

Description

If no content encoding is specified, and (for HTML pages) there's no explicit charset in the meta http-equiv tag, then the charset in the content-type metadata should be used as the "declared encoding" for the CharsetDetector.

Related to this is that the CharsetDetector should have filtering turned on for HTML pages, so that tags get stripped out.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TIKA-341.patch
03/Dec/09 05:58
8 kB
Kenneth William Krugler

Issue Links

relates to

TIKA-2047 TXTParser overwrites mime type/masks types that are subtype of text

Resolved

Activity

People

Assignee:: Jukka Zitting

Reporter:: Kenneth William Krugler

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 03/Dec/09 05:41

Updated:: 05/Aug/16 12:32

Resolved:: 13/Dec/09 01:09