Tika
  1. Tika
  2. TIKA-369

Improve accuracy of language detection

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.6
    • Fix Version/s: None
    • Component/s: languageidentifier
    • Labels:
      None

      Description

      Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues:

      1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a log-likelihood ratio (LLR) test works much better, which would then make language detection faster due to less text needing to be processed.
      2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text.
      3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain.

      1. textcat.pdf
        76 kB
        Ken Krugler
      2. Surprise and Coincidence.pdf
        1.41 MB
        Ken Krugler
      3. lingdet-mccs.pdf
        215 kB
        Ken Krugler

        Issue Links

          Activity

          Ken Krugler created issue -
          Ken Krugler made changes -
          Field Original Value New Value
          Attachment dunning94-trimmed.pdf [ 12431250 ]
          Ken Krugler made changes -
          Attachment lingdet-mccs.pdf [ 12431320 ]
          Ken Krugler made changes -
          Attachment dunning94-trimmed.pdf [ 12431250 ]
          Ken Krugler made changes -
          Description Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues:

          1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a Lucas-Lehmer-Riesel (LLR) test works much better, which would then make language detection faster due to less text needing to be processed. It might be sufficient to re-enable support for 1..4-grams (similar to original Nutch code) to improve quality.
          2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text.
          3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain.

          Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues:

          1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a log-likelihood ratio (LLR) test works much better, which would then make language detection faster due to less text needing to be processed.
          2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text.
          3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain.

          Ken Krugler made changes -
          Attachment Surprise and Coincidence.pdf [ 12431338 ]
          Ken Krugler made changes -
          Link This issue relates to TIKA-209 [ TIKA-209 ]
          Ken Krugler made changes -
          Link This issue is related to NUTCH-666 [ NUTCH-666 ]
          Ken Krugler made changes -
          Link This issue relates to TIKA-465 [ TIKA-465 ]
          Ken Krugler made changes -
          Link This issue relates to TIKA-496 [ TIKA-496 ]
          Ken Krugler made changes -
          Attachment textcat.pdf [ 12460114 ]
          Ken Krugler made changes -
          Link This issue relates to TIKA-322 [ TIKA-322 ]
          Ken Krugler made changes -
          Link This issue is duplicated by TIKA-1091 [ TIKA-1091 ]

            People

            • Assignee:
              Ken Krugler
              Reporter:
              Ken Krugler
            • Votes:
              5 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development