Solr
  1. Solr
  2. SOLR-2839

add alternative language detection impl

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.5, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      based on http://code.google.com/p/language-detection (apache license), supports 53 languages.

      1. SOLR-2839.patch
        94 kB
        Robert Muir

        Activity

        Hide
        Uwe Schindler added a comment -

        Bulk close after 3.5 is released

        Show
        Uwe Schindler added a comment - Bulk close after 3.5 is released
        Hide
        Jan Høydahl added a comment -

        I meant to compare with the situation before the Solr+Lucene merge. It takes a whole lot longer time to wait for a dependency to get released before you can consume it, so then it's ok to add it higher up as a first step.

        Show
        Jan Høydahl added a comment - I meant to compare with the situation before the Solr+Lucene merge. It takes a whole lot longer time to wait for a dependency to get released before you can consume it, so then it's ok to add it higher up as a first step.
        Hide
        Robert Muir added a comment -

        Its not really the same in my opinion. Anyone can commit to both lucene and solr so we can always put things in the correct place.

        Show
        Robert Muir added a comment - Its not really the same in my opinion. Anyone can commit to both lucene and solr so we can always put things in the correct place.
        Hide
        Jan Høydahl added a comment -

        Sure, it's way better to get stuff done than debate on details Great work. Stuff can "bubble down" to Tika later just has stuff has bubbled down from Solr to Lucene..

        Show
        Jan Høydahl added a comment - Sure, it's way better to get stuff done than debate on details Great work. Stuff can "bubble down" to Tika later just has stuff has bubbled down from Solr to Lucene..
        Hide
        Robert Muir added a comment -

        How does this impl compare with the Tika one for short texts? And wouldn't it make more sense to add this on the Tika level letting the detection method be configurable? Then all Tika users would benefit from it.

        I have no idea, probably not that great? But i didnt compare to tika.
        regarding short texts: http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/

        And wouldn't it make more sense to add this on the Tika level letting the detection method be configurable? Then all Tika users would benefit from it.

        If someone wants to do this, then we can remove this implementation at that time. But for lucene/solr, I am able to commit to this project, and I think that its important for langid to be pluggable to different implementations.

        For example, maybe someone ports google's detector (http://src.chromium.org/viewvc/chrome/trunk/src/third_party/cld/) to java and we expose that too, which might be interesting for short texts.

        Show
        Robert Muir added a comment - How does this impl compare with the Tika one for short texts? And wouldn't it make more sense to add this on the Tika level letting the detection method be configurable? Then all Tika users would benefit from it. I have no idea, probably not that great? But i didnt compare to tika. regarding short texts: http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/ And wouldn't it make more sense to add this on the Tika level letting the detection method be configurable? Then all Tika users would benefit from it. If someone wants to do this, then we can remove this implementation at that time. But for lucene/solr, I am able to commit to this project, and I think that its important for langid to be pluggable to different implementations. For example, maybe someone ports google's detector ( http://src.chromium.org/viewvc/chrome/trunk/src/third_party/cld/ ) to java and we expose that too, which might be interesting for short texts.
        Hide
        Jan Høydahl added a comment -

        Cool. The reasoning behind a list of detected languages was that a more advanced detector could go sentence by sentence and tag multi lingual documents correctly. FAST had that capability.

        How does this impl compare with the Tika one for short texts? And wouldn't it make more sense to add this on the Tika level letting the detection method be configurable? Then all Tika users would benefit from it.

        Show
        Jan Høydahl added a comment - Cool. The reasoning behind a list of detected languages was that a more advanced detector could go sentence by sentence and tag multi lingual documents correctly. FAST had that capability. How does this impl compare with the Tika one for short texts? And wouldn't it make more sense to add this on the Tika level letting the detection method be configurable? Then all Tika users would benefit from it.
        Hide
        Robert Muir added a comment -

        ok, i'd like to add this basic implementation first.

        later, we should add support for some advanced parameters and refactoring:

        • whitelisting should not happen in the base class as a post-filter (though this is fine as a default implementation), but subclasses should override i think. For this detector, it could improve performance.
        • for this detector whitelist should support priors too (e.g. en=0.5, fr=0.1).
        • we should add support for configuring smoothing parameter and maxTextLength (and, the base class's concat should respect that too).
        • both this implementation and the tika implementation are copying objects across lists of language information, i think this is not very efficient to do per-document. So I think we should change the API from List<DetectedLanguage> detectLanguage() to Iterable<DetectedLanguage> detectLanguage. It seems in general it just wants the first one anyway.
        Show
        Robert Muir added a comment - ok, i'd like to add this basic implementation first. later, we should add support for some advanced parameters and refactoring: whitelisting should not happen in the base class as a post-filter (though this is fine as a default implementation), but subclasses should override i think. For this detector, it could improve performance. for this detector whitelist should support priors too (e.g. en=0.5, fr=0.1). we should add support for configuring smoothing parameter and maxTextLength (and, the base class's concat should respect that too). both this implementation and the tika implementation are copying objects across lists of language information, i think this is not very efficient to do per-document. So I think we should change the API from List<DetectedLanguage> detectLanguage() to Iterable<DetectedLanguage> detectLanguage. It seems in general it just wants the first one anyway.
        Hide
        Koji Sekiguchi added a comment -

        based on http://code.google.com/p/language-detection (apache license), supports 53 languages.

        I've seen that, too. +1.

        Show
        Koji Sekiguchi added a comment - based on http://code.google.com/p/language-detection (apache license), supports 53 languages. I've seen that, too. +1.
        Hide
        Robert Muir added a comment -

        this is just for reviewing, there are a lot of svn moves etc (so i doubt you can easily apply it)

        Show
        Robert Muir added a comment - this is just for reviewing, there are a lot of svn moves etc (so i doubt you can easily apply it)
        Hide
        Robert Muir added a comment -

        patch, needs the language detection jar and its deps from revision 111 of language-detection (in the lib folder), and the profiles files (into the resources folder)

        Show
        Robert Muir added a comment - patch, needs the language detection jar and its deps from revision 111 of language-detection (in the lib folder), and the profiles files (into the resources folder)

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development