[TIKA-1514] http-equiv content-type extraction should pick first parseable content value - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Trivial
Resolution: Won't Fix
Affects Version/s: 1.6
Fix Version/s: 1.8
Component/s: None
Labels:
None

Description

In a handful of files from govdocs1, there are some creative http-equiv content-type headers, including:

<meta http-equiv="content-type" content="text/html; charset=iso-8859-1" name="keywords" content="DNRC, division of nutrition">

The content type that is going into the metadata for this file is "DNRC, division of nutrition".

Let's modify our html metaheader charset detector to pick the first parseable charset value.

Attachments

Issue Links

is superceded by

TIKA-1519 Don't allow whatever is in http-equiv Content-Type to overwrite actual Content-Type in HtmlParser

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Tim Allison

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 13/Jan/15 16:10

Updated:: 15/Jan/15 18:04

Resolved:: 15/Jan/15 18:04