Tika
  1. Tika
  2. TIKA-322

Improve encoding detection speed and accuracy

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2
    • Component/s: mime
    • Labels:
      None

      Description

      The encoding detection code we took from ICU4J is not very efficient and sometimes produces odd results when more than one encoding matches the given input data. It would be good to refactor the code to be faster for easy-to-detect encodings and to have better heuristics in case multiple matches are found.

        Issue Links

          Activity

          Jukka Zitting made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Assignee Jukka Zitting [ jukkaz ]
          Fix Version/s 1.2 [ 12320169 ]
          Resolution Fixed [ 1 ]
          Hide
          Jukka Zitting added a comment -

          I integrated the juniversalchardet library into TXTParser in revision 1358624. The encoding detection mechanism still falls back to the ICU4J code if juniversalchardet wasn't able to determine the character encoding, so the risk of regressions should be pretty low.

          Show
          Jukka Zitting added a comment - I integrated the juniversalchardet library into TXTParser in revision 1358624. The encoding detection mechanism still falls back to the ICU4J code if juniversalchardet wasn't able to determine the character encoding, so the risk of regressions should be pretty low.
          Hide
          Ken Krugler added a comment -

          Some of the same issues with the n-gram statistical model for language detection also impact the quality of ICU's charset detection code/data, which is being used by Tika.

          Show
          Ken Krugler added a comment - Some of the same issues with the n-gram statistical model for language detection also impact the quality of ICU's charset detection code/data, which is being used by Tika.
          Ken Krugler made changes -
          Link This issue is related to TIKA-369 [ TIKA-369 ]
          Hide
          Felix Meschberger added a comment -

          According to [1] MPL is a Category B license and such licensed work can be included in binary-only form.

          [1] http://www.apache.org/legal/resolved.html#category-b

          Show
          Felix Meschberger added a comment - According to [1] MPL is a Category B license and such licensed work can be included in binary-only form. [1] http://www.apache.org/legal/resolved.html#category-b
          Ken Krugler made changes -
          Field Original Value New Value
          Link This issue relates to TIKA-333 [ TIKA-333 ]
          Hide
          Luke Nezda added a comment -

          http://code.google.com/p/juniversalchardet/ has a pretty good, efficient charset decoder which is a Java port of the Mozilla universalchardet algorithms. It is licensed under Mozilla Public License Version 1.1. I am not sure if MPL is ASF compatible; it appears to be, but ianal. afaik, it does not provide detection confidence or language detection features ICU4J does and I think it has code/data files for less encodings, but it is primarily statistical so they could be added. I am also not sure what choices were made with regard to multiple encodings. In theory, it should detect what Firefox detects for a given URL/file.

          Show
          Luke Nezda added a comment - http://code.google.com/p/juniversalchardet/ has a pretty good, efficient charset decoder which is a Java port of the Mozilla universalchardet algorithms. It is licensed under Mozilla Public License Version 1.1. I am not sure if MPL is ASF compatible; it appears to be, but ianal. afaik, it does not provide detection confidence or language detection features ICU4J does and I think it has code/data files for less encodings, but it is primarily statistical so they could be added. I am also not sure what choices were made with regard to multiple encodings. In theory, it should detect what Firefox detects for a given URL/file.
          Jukka Zitting created issue -

            People

            • Assignee:
              Jukka Zitting
              Reporter:
              Jukka Zitting
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development