Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-496

Language identifier profile comparison favors large profiles

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Do
    • 0.7
    • None
    • languageidentifier
    • None

    Description

      I think I've found a flaw in the distance algorithm.

      In LanguageProfile.java distance() method, we normalize the frequency for an ngram by dividing by the total count.
      The total count for a profile is simply the sum of all counts in the profile.

      Problem is, that the .ngp files are cutoff at 1000 entries, and the total count is then the sum of all those 1000 entries.
      However, there will be a long-tail of lower frequency ngrams which are cut off and therefore not included in the total count.
      Effect is that the ngrams from profiles with large training set are more important than ngrams from smaller training set.

      You can see this effect especially well when classifying short texts in a language wich has similar sister languages with larger training sets. My example is "no" vs "da".

      Sample from the tail of "no.ngp":
      _gå 461
      ask 461
      ria 459
      små 459

      ...and from the tail of "dk.ngp":
      dbr 966
      ost 966
      ævn 964

      It is obvious that "dk" has a longer tail after cutoff than "no" and therefore a larger sum.

      A solution is to count the real total count when generating the .ngp file and storing the total in the profile file itself, instead of counting when loading the cutoff profile.
      Alterniatvely, normalize counts before writing the .ngp file, so that the top entry is always 100000

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              janhoy Jan Høydahl
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: