Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2219

CharsetDetector no longer detects windows-1252 charset

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.14
    • 1.15, 2.0.0
    • parser
    • None
    • Any.

    Description

      Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is always detected instead. While not tested, this likely affects other windows-125* encodings as well.

      I tracked it down to a change in the CharsetRecog_sbcs.CharsetRecog_8859_1#getName() method. Now it always returns "ISO-8859-1" whereas before it was: return haveC1Bytes ? "windows-1252" : "ISO-8859-1";

      Now that condition has been moved to the match(CharsetDetector det) method so that the returned CharsetMatch has the proper name. The problem with that is CharsetDetector#detectAll() method overwrites the correct match with a new one that will return the value of #getName() from the CharsetRecognizer instead (which is always "ISO-8859-1" in this case).

      There might be legitimate reasons why the CharsetMatch instances in detectAll() method are replaced with new ones, but changing this code in that method appears to work for me:

      // Remove this:
      // CharsetMatch m = new CharsetMatch(this, csr, confidence);
      // matches.add(m);

      // Add this instead:
      matches.add(charsetMatch);

      Attachments

        1. test.txt
          0.3 kB
          Matthew Caruana Galizia

        Activity

          People

            Unassigned Unassigned
            pascal.essiembre Pascal Essiembre
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: