[TIKA-539] Encoding detection is too biased by encoding in meta tag - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Reopened
Priority: Minor
Resolution: Unresolved
Affects Version/s: 0.8, 0.9, 0.10
Fix Version/s: 1.17, 2.0.0-BETA, 2.1.0
Component/s: metadata, parser
Labels:
None

Description

if the encoding in the meta tag is wrong, this encoding is detected,
even if there is the right encoding set in metadata before(which can be from http response header).

test code to reproduce:

static String content = "<html><head>\n"
+ "<meta http-equiv=\"content-type\" content=\"application/xhtml+xml; charset=iso-8859-1\" />"
+ "</head><body>Über den Wolken\n</body></html>";

/**

@param args
@throws IOException
@throws TikaException
@throws SAXException
*/
public static void main(String[] args) throws IOException, SAXException,
TikaException { Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, "text/html"); metadata.set(Metadata.CONTENT_ENCODING, "UTF-8"); System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); InputStream in = new ByteArrayInputStream(content.getBytes("UTF-8")); AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler h = new BodyContentHandler(10000); parser.parse(in, h, metadata, new ParseContext()); System.out.print(h.toString()); System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); }

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TIKA-539.patch
26/Oct/10 18:54
3 kB
Reinhard Pötz
TIKA-539_2.patch
26/Oct/10 22:26
3 kB
Reinhard Pötz

Issue Links

is related to

TIKA-431 Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.

Resolved

TIKA-2038 A more accurate facility for detecting Charset Encoding of HTML documents

Open

relates to

TIKA-868 TXT parser does not honour the specified encoding

Closed

Activity

People

Assignee:: Kenneth William Krugler

Reporter:: Reinhard Pötz

Votes:: 3 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 26/Oct/10 18:14

Updated:: 17/Aug/21 13:29