Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-369

Improve accuracy of language detection

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.6
    • Fix Version/s: None
    • Component/s: languageidentifier
    • Labels:
      None

      Description

      Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues:

      1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a log-likelihood ratio (LLR) test works much better, which would then make language detection faster due to less text needing to be processed.
      2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text.
      3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain.

        Attachments

        1. textcat.pdf
          76 kB
          Ken Krugler
        2. Surprise and Coincidence.pdf
          1.41 MB
          Ken Krugler
        3. lingdet-mccs.pdf
          215 kB
          Ken Krugler

          Issue Links

            Activity

              People

              • Assignee:
                kkrugler Ken Krugler
                Reporter:
                kkrugler Ken Krugler
              • Votes:
                5 Vote for this issue
                Watchers:
                13 Start watching this issue

                Dates

                • Created:
                  Updated: