Nutch
  1. Nutch
  2. NUTCH-666

Analysis plugins for multiple language and new Language Identifier Tool

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 1.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      All

    • Patch Info:
      Patch Available

      Description

      Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646.

      1. NUTCH-666-2-20091217-nf.patch
        398 kB
        Dennis Kubes
      2. NUTCH-666-1-20081126.patch
        31 kB
        Dennis Kubes

        Issue Links

          Activity

          Hide
          Lewis John McGibbney added a comment -

          Thank you Dennis for confirming.

          Show
          Lewis John McGibbney added a comment - Thank you Dennis for confirming.
          Hide
          Dennis Kubes added a comment -

          I am still here. I still keep track of the lists even though I haven't been as active with Nutch because I have been doing more real time search work, Zoie. I agree with closing this issue, Tika is a better solution nowadays then we had when this patch was started.

          Show
          Dennis Kubes added a comment - I am still here. I still keep track of the lists even though I haven't been as active with Nutch because I have been doing more real time search work, Zoie. I agree with closing this issue, Tika is a better solution nowadays then we had when this patch was started.
          Hide
          Lewis John McGibbney added a comment -

          I understand that this is not my issue to close, however I have not seen Dennis on the lists for a wee while and in an attempt to clean up dated issues on the JIRA I think it important to progress with closing this issue as there has been no objection otherwise. In addition, it seems only logical that the issue be marked as won't fix and closed as per the comments provided by various committers.

          If this decision is not welcomed then I will happily reopen :|

          Show
          Lewis John McGibbney added a comment - I understand that this is not my issue to close, however I have not seen Dennis on the lists for a wee while and in an attempt to clean up dated issues on the JIRA I think it important to progress with closing this issue as there has been no objection otherwise. In addition, it seems only logical that the issue be marked as won't fix and closed as per the comments provided by various committers. If this decision is not welcomed then I will happily reopen :|
          Hide
          Chris A. Mattmann added a comment -

          Hi Lewis,

          That's fine with me. My comment just meant that I was pushing this issue out and wasn't going to solve it in the last release (aka, push it to release N+1). I'm fine with marking it was "won't fix" though for the reasons you mentioned.

          Thanks!

          Show
          Chris A. Mattmann added a comment - Hi Lewis, That's fine with me. My comment just meant that I was pushing this issue out and wasn't going to solve it in the last release (aka, push it to release N+1). I'm fine with marking it was "won't fix" though for the reasons you mentioned. Thanks!
          Hide
          Lewis John McGibbney added a comment -

          Chris excuse my naivety but I am unfamiliar with you're phrasing above so excuse if you have dealt with this issue to some degree.

          If this is not the case can I suggest that we mark this as won't fix? Julien has instigated the transition to language detection via delegation to Apache Tika as per NUTCH-1075 .

          Show
          Lewis John McGibbney added a comment - Chris excuse my naivety but I am unfamiliar with you're phrasing above so excuse if you have dealt with this issue to some degree. If this is not the case can I suggest that we mark this as won't fix? Julien has instigated the transition to language detection via delegation to Apache Tika as per NUTCH-1075 .
          Hide
          Chris A. Mattmann added a comment -
          Show
          Chris A. Mattmann added a comment - pushing this out per http://bit.ly/c7tBv9
          Hide
          Julien Nioche added a comment -

          I agree with Sami that this should be contributed to Tika and that we delegate the language identification handling in Nutch to Tika, just as we are doing or planning to for the MimeType and the parsing

          Show
          Julien Nioche added a comment - I agree with Sami that this should be contributed to Tika and that we delegate the language identification handling in Nutch to Tika, just as we are doing or planning to for the MimeType and the parsing
          Hide
          Dennis Kubes added a comment -

          I don't remember exactly what the difference was, but I do remember that there was a subtle difference in the algorithms that was only noticed after creating the new tools. I think it had something to do with how the ngrams were being handled or that it was taking spaces into account. But try running the identifiers side by side, you will see there is a considerable difference.

          Show
          Dennis Kubes added a comment - I don't remember exactly what the difference was, but I do remember that there was a subtle difference in the algorithms that was only noticed after creating the new tools. I think it had something to do with how the ngrams were being handled or that it was taking spaces into account. But try running the identifiers side by side, you will see there is a considerable difference.
          Hide
          Andrzej Bialecki added a comment -

          Do you think it was related to the quality of language models that you built (presumably the ones in the patch?) versus the ones in the Nutch plugin, or due to a different classification algorithm? I'm trying to understand the source of such a big difference, because AFAIK the algorithm in textcat is essentially the same as the one we use.

          Show
          Andrzej Bialecki added a comment - Do you think it was related to the quality of language models that you built (presumably the ones in the patch?) versus the ones in the Nutch plugin, or due to a different classification algorithm? I'm trying to understand the source of such a big difference, because AFAIK the algorithm in textcat is essentially the same as the one we use.
          Hide
          Dennis Kubes added a comment -

          BTW, the reason we did this code, which we worked with an NLP firm to create, versus using the current Langauge identification tool in Nutch was accuracy. The current tool we were getting around 70% accuracy level while this new tool routinely came in above 99.5% accuracy. We trained off of wikipedia and most of the errors we saw were english characters in other-language version of the training data.

          Show
          Dennis Kubes added a comment - BTW, the reason we did this code, which we worked with an NLP firm to create, versus using the current Langauge identification tool in Nutch was accuracy. The current tool we were getting around 70% accuracy level while this new tool routinely came in above 99.5% accuracy. We trained off of wikipedia and most of the errors we saw were english characters in other-language version of the training data.
          Hide
          Dennis Kubes added a comment -

          Here is the patch as I last used it, almost a year ago now. I am not sure if it is functioning or not with the current codebase. It uses a hacky version of textcat to create fingerprint files on known language content, this creates a dictionary, that dictionary is configured through the textcat.conf file in the conf directory. The Language Identifier tool is then used to create a database of url -> langugage code, which before was included using the CustomFields job of the fields indexer. The other language analysis plugins from the previous patch acted off of locale or chosen language on the query side I think.

          Show
          Dennis Kubes added a comment - Here is the patch as I last used it, almost a year ago now. I am not sure if it is functioning or not with the current codebase. It uses a hacky version of textcat to create fingerprint files on known language content, this creates a dictionary, that dictionary is configured through the textcat.conf file in the conf directory. The Language Identifier tool is then used to create a database of url -> langugage code, which before was included using the CustomFields job of the fields indexer. The other language analysis plugins from the previous patch acted off of locale or chosen language on the query side I think.
          Hide
          Sami Siren added a comment -

          We should also consider switching to Tika for language identification and route the proposed improvements in that area through Tika?

          Show
          Sami Siren added a comment - We should also consider switching to Tika for language identification and route the proposed improvements in that area through Tika?
          Hide
          Andrzej Bialecki added a comment -

          Dennis, what's the status of this patch (especially the missing part, the new language identifier)?

          Show
          Andrzej Bialecki added a comment - Dennis, what's the status of this patch (especially the missing part, the new language identifier)?
          Hide
          Raja Santosh Panda added a comment -

          Hi,

          I am looking forward to use only the language identifier (language-identifier.jar) plugin for identification of chinese, japanese and korean languages.

          Can someone help me in this regard ?

          Is this already implemented ? If yes, how can i take the dev version and use it ?

          Can i use the language identifier of version 1.0 and train it (create N-Gram profiles) to identify the above 3 languages ??

          Any help is highly appreciated.

          Regards
          Raja

          Show
          Raja Santosh Panda added a comment - Hi, I am looking forward to use only the language identifier (language-identifier.jar) plugin for identification of chinese, japanese and korean languages. Can someone help me in this regard ? Is this already implemented ? If yes, how can i take the dev version and use it ? Can i use the language identifier of version 1.0 and train it (create N-Gram profiles) to identify the above 3 languages ?? Any help is highly appreciated. Regards Raja
          Hide
          Otis Gospodnetic added a comment -

          Dennis, could you please describe how this new Lang ID tool is better/different from the previous one?

          Show
          Otis Gospodnetic added a comment - Dennis, could you please describe how this new Lang ID tool is better/different from the previous one?
          Hide
          Dennis Kubes added a comment -

          It is ok to move to 1.1.

          Show
          Dennis Kubes added a comment - It is ok to move to 1.1.
          Hide
          Doğacan Güney added a comment -

          Dennis, is it OK to move this issue out of 1.0? Or do you want to commit it before?

          Show
          Doğacan Güney added a comment - Dennis, is it OK to move this issue out of 1.0? Or do you want to commit it before?
          Hide
          Dennis Kubes added a comment -

          Fixed patch. Now includes the changes to AnalyzerFactory to allow multiple languages per plugin.

          Show
          Dennis Kubes added a comment - Fixed patch. Now includes the changes to AnalyzerFactory to allow multiple languages per plugin.
          Hide
          Dennis Kubes added a comment -

          Part one of patch. This includes the new analyzers for different languages. Part two will include the new language identifier tool.

          Show
          Dennis Kubes added a comment - Part one of patch. This includes the new analyzers for different languages. Part two will include the new language identifier tool.

            People

            • Assignee:
              Dennis Kubes
              Reporter:
              Dennis Kubes
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Due:
                Created:
                Updated:
                Resolved:

                Development