Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2449

Usage of Tika LanguageIdentifier in language-identifier plugin

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.13
    • 1.19
    • plugin
    • None

    Description

      The language-identifier plugin uses org.apache.tika.language.LanguageIdentifier for extracting the language from the document text. There are two issues with that:

      1. LanguageIdentifier is deprecated in Tika.
      2. It does not support CJK language (and I suspect a lot of other languages - https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes), and it doesn’t even fail gracefully with them - in my experience Chinese was recognized as Italian.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yossi Yossi Tamari
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: