Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
0.6
-
None
-
None
Description
If no content encoding is specified, and (for HTML pages) there's no explicit charset in the meta http-equiv tag, then the charset in the content-type metadata should be used as the "declared encoding" for the CharsetDetector.
Related to this is that the CharsetDetector should have filtering turned on for HTML pages, so that tags get stripped out.
Attachments
Attachments
Issue Links
- relates to
-
TIKA-2047 TXTParser overwrites mime type/masks types that are subtype of text
- Resolved