Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-465

LanguageIdentifier API enhancements

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Won't Fix
    • None
    • None
    • languageidentifier
    • None

    Description

      As originally reported by Jerome Charron in NUTCH-86, Jerome identified a set of improvements for the LanguageIdentifier that we should consider in Tika:

      More informations can be found on the following thread on Nutch-Dev mailing list:
      http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00569.html

      Summary:

      1. LanguageIdentifier API changes. The similarity methods should return an ordered array of language-code/score pairs instead of a simple String containing the language-code.

      2. Ensure consistency between LanguageIdentifier scoring and NGramProfile.getSimilarity().

      I just wanted to capture the issue here in Tika, since I'm about to close it out in Nutch since LanguageIdentification is something that can happen in Tika-ville...

      Attachments

        Issue Links

          Activity

            People

              kkrugler Kenneth William Krugler
              chrismattmann Chris A. Mattmann
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: