[TIKA-2219] CharsetDetector no longer detects windows-1252 charset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.14
Fix Version/s: 1.15, 2.0.0
Component/s: parser
Labels:
None
Environment:

Any.

Description

Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is always detected instead. While not tested, this likely affects other windows-125* encodings as well.

I tracked it down to a change in the CharsetRecog_sbcs.CharsetRecog_8859_1#getName() method. Now it always returns "ISO-8859-1" whereas before it was: return haveC1Bytes ? "windows-1252" : "ISO-8859-1";

Now that condition has been moved to the match(CharsetDetector det) method so that the returned CharsetMatch has the proper name. The problem with that is CharsetDetector#detectAll() method overwrites the correct match with a new one that will return the value of #getName() from the CharsetRecognizer instead (which is always "ISO-8859-1" in this case).

There might be legitimate reasons why the CharsetMatch instances in detectAll() method are replaced with new ones, but changing this code in that method appears to work for me:

// Remove this:
// CharsetMatch m = new CharsetMatch(this, csr, confidence);
// matches.add(m);

// Add this instead:
matches.add(charsetMatch);

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

test.txt
31/Aug/17 21:55
0.3 kB
Matthew Caruana Galizia

Activity

People

Assignee:: Unassigned

Reporter:: Pascal Essiembre

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 19/Dec/16 22:12

Updated:: 12/Apr/21 13:01

Resolved:: 20/Dec/16 19:26