[TIKA-2758] Possible error charset detection - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.18
Fix Version/s: 2.0.0-BETA
Component/s: core
Labels:
None

Description

I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran all 995 unit tests and observed three failures, two encoding issues and one other weird thing. The tests use real HTML.

Where we previously extracted text such as 'Spokane, Wash. [— The solar' we now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could take ["weeks, or' but we not get 'could take [â€œweeks, or' extracted. Our tests pass with 1.17 but fail with 1.18 and 1.19.1.

Attached are the two HTML files.

Reading our tests again, i see an old note besides the indepedent test complaining about the character encoding being incorrect. It seems somewhere before 1.17 it was faultly just as it is now with 1.18 and higher.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

detroidnews.html
18/Oct/18 11:07
127 kB
Markus Jelsma
independent.html
18/Oct/18 11:07
216 kB
Markus Jelsma
grep_charsets.csv
26/Oct/18 15:19
24 kB
Tim Allison

Issue Links

is caused by

TIKA-2592 HTML with charset unicode handled as utf-16 instead utf-8

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Markus Jelsma

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 18/Oct/18 11:07

Updated:: 21/Jul/21 22:13