[TIKA-2771] enableInputFilter() wrecks charset detection for some short html documents - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 1.19.1
Fix Version/s: None
Component/s: detector
Labels:
None

Description

When I try to run the CharsetDetector on http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange most confident result of "IBM500" with a confidence of 60 when I enable the input filter, even if I set the declared encoding to UTF-8.

This can be replicated with the following code:

CharsetDetector detect = new CharsetDetector();
detect.enableInputFilter(true);
detect.setDeclaredEncoding("UTF-8");
detect.setText(("<!DOCTYPE html>\n" +
        "<div>\n" +
        "  <div itemscope itemtype=\"http://schema.org/Person\" id=\"amanda\" itemref=\"a b\"></div>\n" +
        "  <p id=\"a\">Name: <span itemprop=\"name\">Amanda</span></p>\n" +
        "  <p id=\"b\" itemprop=\"band\">Jazz Band</p>\n" +
        "</div>").getBytes(StandardCharsets.UTF_8));
Arrays.stream(detect.detectAll()).forEach(System.out::println);

which prints:

Match of IBM500 in fr with confidence 60
Match of UTF-8 with confidence 57
Match of ISO-8859-9 in tr with confidence 50
Match of ISO-8859-1 in en with confidence 50
Match of ISO-8859-2 in cs with confidence 12
Match of Big5 in zh with confidence 10
Match of EUC-KR in ko with confidence 10
Match of EUC-JP in ja with confidence 10
Match of GB18030 in zh with confidence 10
Match of Shift_JIS in ja with confidence 10
Match of UTF-16LE with confidence 10
Match of UTF-16BE with confidence 10

Note that if I do not set the declared encoding to UTF-8, the result is even worse, with UTF-8 falling from a confidence of 57 to 15.

This is screwing up 1 out of 84 of my online microdata extraction tests over in Any23 (as that particular page is being rendered into complete gibberish), so I had to implement some hacky workarounds which I'd like to remove if possible.

EDIT: This issue may be related to TIKA-2737 and this comment.

Attachments

Issue Links

is related to

TIKA-2038 A more accurate facility for detecting Charset Encoding of HTML documents

Open

relates to

TIKA-2737 regression in charset detection

Open

TIKA-771 "Hello, World!" in UTF-8/ASCII gets detected as IBM500

Resolved

TIKA-868 TXT parser does not honour the specified encoding

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Hans Brende

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 01/Nov/18 17:30

Updated:: 29/Oct/19 15:52