Details
-
Bug
-
Status: Closed
-
Trivial
-
Resolution: Won't Fix
-
1.6
-
None
-
None
Description
In a handful of files from govdocs1, there are some creative http-equiv content-type headers, including:
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1" name="keywords" content="DNRC, division of nutrition">
The content type that is going into the metadata for this file is "DNRC, division of nutrition".
Let's modify our html metaheader charset detector to pick the first parseable charset value.
Attachments
Issue Links
- is superceded by
-
TIKA-1519 Don't allow whatever is in http-equiv Content-Type to overwrite actual Content-Type in HtmlParser
- Resolved