Tika
  1. Tika
  2. TIKA-369

Improve accuracy of language detection

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.6
    • Fix Version/s: None
    • Component/s: languageidentifier
    • Labels:
      None

      Description

      Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues:

      1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a log-likelihood ratio (LLR) test works much better, which would then make language detection faster due to less text needing to be processed.
      2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text.
      3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain.

      1. lingdet-mccs.pdf
        215 kB
        Ken Krugler
      2. Surprise and Coincidence.pdf
        1.41 MB
        Ken Krugler
      3. textcat.pdf
        76 kB
        Ken Krugler

        Issue Links

          Activity

          Hide
          Ken Krugler added a comment - - edited

          Karl Wettin had contributed a language detector to Lucene, though it was never rolled in. See LUCENE-826. This might be an interesting alternative.

          Jean-François Halleux also contributed a "language guesser" to Lucene a while back. See LUCENE-180. This was markd as duplication of LUCENE-826.

          Show
          Ken Krugler added a comment - - edited Karl Wettin had contributed a language detector to Lucene, though it was never rolled in. See LUCENE-826 . This might be an interesting alternative. Jean-François Halleux also contributed a "language guesser" to Lucene a while back. See LUCENE-180 . This was markd as duplication of LUCENE-826 .
          Hide
          Ken Krugler added a comment -

          Smaller version of Ted Dunning's 1994 paper.

          Show
          Ken Krugler added a comment - Smaller version of Ted Dunning's 1994 paper.
          Hide
          Ken Krugler added a comment -

          Attaching another paper from Ted that makes it clearer why the chi-squared method currently used has problems for small text chunks.

          Show
          Ken Krugler added a comment - Attaching another paper from Ted that makes it clearer why the chi-squared method currently used has problems for small text chunks.
          Hide
          Ken Krugler added a comment -

          See http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html for a nice review (slightly dated) of available language detection packages. The cpdetector package did a good job of parsing HTML metadata tags & XML header info - it uses an ANTLR grammar.

          Show
          Ken Krugler added a comment - See http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html for a nice review (slightly dated) of available language detection packages. The cpdetector package did a good job of parsing HTML metadata tags & XML header info - it uses an ANTLR grammar.
          Hide
          Ken Krugler added a comment -

          Including original paper for reference.

          Show
          Ken Krugler added a comment - Including original paper for reference.
          Hide
          Jan Høydahl added a comment -

          Any new thoughts on this one? Seems like LUCENE-826 might be better and more complete than the current LangId in Tika.
          Also, there is an idea of using dictionary based matching for small texts. Perhaps based on lucene-hunspell and Ooo dictionaries? What do you think of such a hybrid solution?

          Show
          Jan Høydahl added a comment - Any new thoughts on this one? Seems like LUCENE-826 might be better and more complete than the current LangId in Tika. Also, there is an idea of using dictionary based matching for small texts. Perhaps based on lucene-hunspell and Ooo dictionaries? What do you think of such a hybrid solution?
          Hide
          Ken Krugler added a comment -

          From reading through LUCENE-826, it appears like there are a lot of dependencies - which Karl alluded to in one of his comments.

          Other than that, seems like it would be much better than the current code.

          Show
          Ken Krugler added a comment - From reading through LUCENE-826 , it appears like there are a lot of dependencies - which Karl alluded to in one of his comments. Other than that, seems like it would be much better than the current code.
          Hide
          Ted Dunning added a comment -

          Dictionary based matching is not going to work as well as what Karl was
          proposing. The method that Ken was proposing for TIKA (that I developed
          many eons ago) is lighter weight than what Karl is suggesting, but should be
          almost as good.

          In my experiments with 8 languages, I was able to classify 50 byte strings
          with very high accuracy with only a few kb of training data. Dictionary
          techniques do not do as well. With the availability of training data from
          Wikipedia, there is really little excuse for anything by a learning
          approach.

          Whether Karl's work is ready to use or not is an open question.

          2011/6/26 Jan Høydahl (JIRA) <jira@apache.org>

          Show
          Ted Dunning added a comment - Dictionary based matching is not going to work as well as what Karl was proposing. The method that Ken was proposing for TIKA (that I developed many eons ago) is lighter weight than what Karl is suggesting, but should be almost as good. In my experiments with 8 languages, I was able to classify 50 byte strings with very high accuracy with only a few kb of training data. Dictionary techniques do not do as well. With the availability of training data from Wikipedia, there is really little excuse for anything by a learning approach. Whether Karl's work is ready to use or not is an open question. 2011/6/26 Jan Høydahl (JIRA) <jira@apache.org>
          Hide
          Ted Dunning added a comment -

          I think that Ken discovered that the major optimization was to merely
          classify the beginning of a document.

          2011/6/26 Ted Dunning <ted.dunning@gmail.com>

          Show
          Ted Dunning added a comment - I think that Ken discovered that the major optimization was to merely classify the beginning of a document. 2011/6/26 Ted Dunning <ted.dunning@gmail.com>
          Hide
          Georger Araújo added a comment - - edited

          I've had great results with the language-detection library [1,2].
          Pros: great accuracy, fast, Apache licensed.
          Cons: unsurprisingly, has trouble with short text.

          [1] http://code.google.com/p/language-detection/
          [2] http://www.slideshare.net/shuyo/language-detection-library-for-java

          Show
          Georger Araújo added a comment - - edited I've had great results with the language-detection library [1,2] . Pros: great accuracy, fast, Apache licensed. Cons: unsurprisingly, has trouble with short text. [1] http://code.google.com/p/language-detection/ [2] http://www.slideshare.net/shuyo/language-detection-library-for-java
          Hide
          Joseph Vychtrle added a comment -

          Imho the CERTAINTY_LIMIT is too rigorous. I was testing documents that had 5000+ words and the detection was uncertain in 10 from 10 cases... However it was correct in 10 from 10 cases...

          Show
          Joseph Vychtrle added a comment - Imho the CERTAINTY_LIMIT is too rigorous. I was testing documents that had 5000+ words and the detection was uncertain in 10 from 10 cases... However it was correct in 10 from 10 cases...
          Hide
          Joseph Vychtrle added a comment -

          Wouldn't it be better if the field wasn't private and developers could set it up according to the situation ?

          Show
          Joseph Vychtrle added a comment - Wouldn't it be better if the field wasn't private and developers could set it up according to the situation ?
          Hide
          Robert Muir added a comment -

          Cons: unsurprisingly, has trouble with short text.

          Not any less trouble than competing libraries:
          http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

          Its interesting if you read their paper, I think the normalizations etc. they made make total sense
          and I can easily see how that would make a big difference on train vs. test when train is stuff
          like wikipedia (which isn't always totally realistic).

          I haven't played with their approach for CJK detection but it makes sense to me, would be great to
          see some evaluation results for that case.

          On the other hand I think CLD has nice stuff like segmenting per-script (not ambiguous) first to
          eliminate stupidity when a document has multiple scripts (e.g. cyrillic+latin or arabic+latin)..
          it would be great if the cybozu impl integrated this approach as well.

          Show
          Robert Muir added a comment - Cons: unsurprisingly, has trouble with short text. Not any less trouble than competing libraries: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html Its interesting if you read their paper, I think the normalizations etc. they made make total sense and I can easily see how that would make a big difference on train vs. test when train is stuff like wikipedia (which isn't always totally realistic). I haven't played with their approach for CJK detection but it makes sense to me, would be great to see some evaluation results for that case. On the other hand I think CLD has nice stuff like segmenting per-script (not ambiguous) first to eliminate stupidity when a document has multiple scripts (e.g. cyrillic+latin or arabic+latin).. it would be great if the cybozu impl integrated this approach as well.
          Hide
          Christian Moen added a comment - - edited

          Does anyone have any thoughts on how we should follow up on this?

          The language-detection library looks attractive to me and seems to be the best Java-based language detection library available and it also has a suitable license. However, it seems to require Java 6 and Tika is still based on Java 5. Does this effectively rule out using language-detection for Tika?

          Does it make sense to make language-detection an option that can be used as an alternative to the current detector?

          The idea is basically to support language-detection in addition to what we have today with the latter being the default.

          Show
          Christian Moen added a comment - - edited Does anyone have any thoughts on how we should follow up on this? The language-detection library looks attractive to me and seems to be the best Java-based language detection library available and it also has a suitable license. However, it seems to require Java 6 and Tika is still based on Java 5. Does this effectively rule out using language-detection for Tika? Does it make sense to make language-detection an option that can be used as an alternative to the current detector? The idea is basically to support language-detection in addition to what we have today with the latter being the default.
          Hide
          Robert Muir added a comment -

          However, it seems to require Java 6 and Tika is still based on Java 5.

          It doesn't actually require java 6... it was just compiled that way.
          If you recompile it with java 5 it works fine without any source code changes:
          Thats how we support it in Solr 3.x (which only requires java5)

          Show
          Robert Muir added a comment - However, it seems to require Java 6 and Tika is still based on Java 5. It doesn't actually require java 6... it was just compiled that way. If you recompile it with java 5 it works fine without any source code changes: Thats how we support it in Solr 3.x (which only requires java5)
          Hide
          Christian Moen added a comment -

          Thanks for clarifying this, Robert. This is good news vis-à-vis any use in Tika.

          Show
          Christian Moen added a comment - Thanks for clarifying this, Robert. This is good news vis-à-vis any use in Tika.
          Hide
          Pander Musubi added a comment -

          +1 for using https://code.google.com/p/language-detection/ which supports more languages with better models and now also supports detection of language in short texts (-sm = short messages).

          Show
          Pander Musubi added a comment - +1 for using https://code.google.com/p/language-detection/ which supports more languages with better models and now also supports detection of language in short texts (-sm = short messages).
          Hide
          Michael McCandless added a comment -
          Show
          Michael McCandless added a comment - +1 to cut over to https://code.google.com/p/language-detection
          Hide
          Pander Musubi added a comment -

          language-detection uses a variable length n-grams.

          Show
          Pander Musubi added a comment - language-detection uses a variable length n-grams.
          Hide
          Michael McCandless added a comment -

          The language-detection lib is now in Maven: http://search.maven.org/#artifactdetails|com.cybozu.labs|langdetect|1.1-20120112|jar

          And it's compiled to Java 5 ...

          I think we should do a hard cutover (replace Tika's current language detection with this library)? Any objections?

          Show
          Michael McCandless added a comment - The language-detection lib is now in Maven: http://search.maven.org/#artifactdetails |com.cybozu.labs|langdetect|1.1-20120112|jar And it's compiled to Java 5 ... I think we should do a hard cutover (replace Tika's current language detection with this library)? Any objections?
          Hide
          Ted Dunning added a comment -

          It is hard to object, but it would be good to replicate the accuracy numbers on the kind of text that Tika typically sees.

          Show
          Ted Dunning added a comment - It is hard to object, but it would be good to replicate the accuracy numbers on the kind of text that Tika typically sees.
          Hide
          Michael McCandless added a comment -

          Back when I tested this, the best test corpus I could find was Europarl (21 languages) and language-detection did very well (99.22% vs Tika's 97.12%) and was also much faster (1.18 MB/sec vs Tika's .066 MB/sec).

          I agree if the's a corpus better than Europarl (more like the kind of text Tika typically sees), we should test it...

          Show
          Michael McCandless added a comment - Back when I tested this, the best test corpus I could find was Europarl (21 languages) and language-detection did very well (99.22% vs Tika's 97.12%) and was also much faster (1.18 MB/sec vs Tika's .066 MB/sec). I agree if the's a corpus better than Europarl (more like the kind of text Tika typically sees), we should test it...
          Hide
          Ken Krugler added a comment -

          I've been using language-detection in another project for six months. In general it works better than what's in Tika, but has a number of design/coding issues (gnarly singleton DetectorFactory, assumption that profiles are loaded from external files, problems with setting a priori language probabilities). I've got a fork of it with some fixes, but it's not ready for prime time.

          So net-net is a mild +1 from me, but I think there may be some post-integration challenges.

          Show
          Ken Krugler added a comment - I've been using language-detection in another project for six months. In general it works better than what's in Tika, but has a number of design/coding issues (gnarly singleton DetectorFactory, assumption that profiles are loaded from external files, problems with setting a priori language probabilities). I've got a fork of it with some fixes, but it's not ready for prime time. So net-net is a mild +1 from me, but I think there may be some post-integration challenges.
          Hide
          Robert Muir added a comment -

          The DetectorFactory is definitely gnarly, but you can load the JSON of the profiles
          yourself from resource files e.g. in the JAR and use loadProfile(List<String> json_profiles).

          This is how solr worked around the issue of wanting to bundle profiles easily in the JAR.

          Show
          Robert Muir added a comment - The DetectorFactory is definitely gnarly, but you can load the JSON of the profiles yourself from resource files e.g. in the JAR and use loadProfile(List<String> json_profiles). This is how solr worked around the issue of wanting to bundle profiles easily in the JAR.
          Hide
          Chris A. Mattmann added a comment -

          +1 from me, I'm fine with it. Incremental improvement is always nice and if it doesn't work out, we can always roll back.

          Show
          Chris A. Mattmann added a comment - +1 from me, I'm fine with it. Incremental improvement is always nice and if it doesn't work out, we can always roll back.
          Hide
          Pander Musubi added a comment -

          I know someone from another community who has created a Java Servlet around https://code.google.com/p/language-detection and it will be submitted back to that project. At the moment he is making some improvements to already functioning version but he could use some extra hands. If anybody is interested in his current version in a Git repo please contact me and I will introduce the both of you.

          Show
          Pander Musubi added a comment - I know someone from another community who has created a Java Servlet around https://code.google.com/p/language-detection and it will be submitted back to that project. At the moment he is making some improvements to already functioning version but he could use some extra hands. If anybody is interested in his current version in a Git repo please contact me and I will introduce the both of you.
          Hide
          Ken Krugler added a comment -

          Some questions then about integrating language-detection:

          1. Do we care about thread safety?

          If yes, then I think we'd either need our own version of the library, or get some fixes rolled into the upstream project.

          2. How much control over settings?

          E.g. specifying the set of supported languages, assigning a priori language probabilities, specifying max text length, etc?

          If neither is an issue, then I could roll this in pretty quickly.

          Show
          Ken Krugler added a comment - Some questions then about integrating language-detection: 1. Do we care about thread safety? If yes, then I think we'd either need our own version of the library, or get some fixes rolled into the upstream project. 2. How much control over settings? E.g. specifying the set of supported languages, assigning a priori language probabilities, specifying max text length, etc? If neither is an issue, then I could roll this in pretty quickly.

            People

            • Assignee:
              Ken Krugler
              Reporter:
              Ken Krugler
            • Votes:
              5 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development