Lucene - Core
  1. Lucene - Core
  2. LUCENE-3888

split off the spell check word and surface form in spell check dictionary

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, 5.0
    • Component/s: modules/spellchecker
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The "did you mean?" feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker.

      I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example.

      1. LUCENE-3888.patch
        4 kB
        Koji Sekiguchi
      2. LUCENE-3888.patch
        6 kB
        Robert Muir
      3. LUCENE-3888.patch
        6 kB
        Robert Muir
      4. LUCENE-3888.patch
        14 kB
        Koji Sekiguchi
      5. LUCENE-3888.patch
        16 kB
        Robert Muir
      6. LUCENE-3888.patch
        26 kB
        Robert Muir

        Activity

        Hide
        Uwe Schindler added a comment -

        Move issue to Lucene 4.9.

        Show
        Uwe Schindler added a comment - Move issue to Lucene 4.9.
        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Robert Muir added a comment -

        rmuir20120906-bulk-40-change

        Show
        Robert Muir added a comment - rmuir20120906-bulk-40-change
        Hide
        Hoss Man added a comment -

        bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment

        Show
        Hoss Man added a comment - bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment
        Hide
        Robert Muir added a comment -

        Thanks for the feedback Koji.

        I'm not happy with the situation: I thought it would be easy to support
        some rough Japanese spellcheck in 3.6

        But it just seems like we need to do a lot of cleanup to make it work,
        I would rather fix all of these APIs and do it right the first time so
        that things like distributed support work too.

        Show
        Robert Muir added a comment - Thanks for the feedback Koji. I'm not happy with the situation: I thought it would be easy to support some rough Japanese spellcheck in 3.6 But it just seems like we need to do a lot of cleanup to make it work, I would rather fix all of these APIs and do it right the first time so that things like distributed support work too.
        Hide
        Koji Sekiguchi added a comment -

        Thanks Robert for giving some patches and comment.

        The only option for 3.6 would be something like my previous patch
        (https://issues.apache.org/jira/secure/attachment/12519860/LUCENE-3888.patch) which
        has the disadvantages of doing the second-phase re-ranking on surface forms.

        With the disadvantages, the spell checker won't work well for Japanese anyway. I give up this for 3.6.

        Show
        Koji Sekiguchi added a comment - Thanks Robert for giving some patches and comment. The only option for 3.6 would be something like my previous patch ( https://issues.apache.org/jira/secure/attachment/12519860/LUCENE-3888.patch ) which has the disadvantages of doing the second-phase re-ranking on surface forms. With the disadvantages, the spell checker won't work well for Japanese anyway. I give up this for 3.6.
        Hide
        Robert Muir added a comment -

        In my opinion we should set this as fix for 4.0

        The only option for 3.6 would be something like my previous patch
        (https://issues.apache.org/jira/secure/attachment/12519860/LUCENE-3888.patch) which
        has the disadvantages of doing the second-phase re-ranking on surface forms.

        Any other opinions?

        Show
        Robert Muir added a comment - In my opinion we should set this as fix for 4.0 The only option for 3.6 would be something like my previous patch ( https://issues.apache.org/jira/secure/attachment/12519860/LUCENE-3888.patch ) which has the disadvantages of doing the second-phase re-ranking on surface forms. Any other opinions?
        Hide
        Robert Muir added a comment -

        updated patch (note with this one: Solr does not yet compile).

        I went the route of trying to clean up these apis correctly: I think there are serious problems here.

        The biggest violation is stuff like:

        // convert to array string: 
        // nocommit: why don't we just return SuggestWord[] with all the information?
        // consumers such as Solr must be recomputing this stuff again?!
        String[] list = new String[sugQueue.size()];
        for (int i = sugQueue.size() - 1; i >= 0; i--) {
         list[i] = sugQueue.pop().getSurface();
        }
        
        return list;
        

        DirectSpellChecker already returns all this data, I think its doing the right thing, but I think SpellChecker should be fixed. Even for the normal case surely we are recomputing docFreq etc on all the candidates which is wasteful.

        I'll keep plugging away but it seems like this will be a pretty serious refactoring (including e.g. distributed spellcheck refactoring) and difficult for 3.6.

        Show
        Robert Muir added a comment - updated patch (note with this one: Solr does not yet compile). I went the route of trying to clean up these apis correctly: I think there are serious problems here. The biggest violation is stuff like: // convert to array string: // nocommit: why don't we just return SuggestWord[] with all the information? // consumers such as Solr must be recomputing this stuff again?! String [] list = new String [sugQueue.size()]; for ( int i = sugQueue.size() - 1; i >= 0; i--) { list[i] = sugQueue.pop().getSurface(); } return list; DirectSpellChecker already returns all this data, I think its doing the right thing, but I think SpellChecker should be fixed. Even for the normal case surely we are recomputing docFreq etc on all the candidates which is wasteful. I'll keep plugging away but it seems like this will be a pretty serious refactoring (including e.g. distributed spellcheck refactoring) and difficult for 3.6.
        Hide
        Robert Muir added a comment -

        I updated the patch and fixed Koji's test, its passing BUT there is a nocommit:

        // nocommit: we need to fix SuggestWord to separate surface and analyzed forms.
        // currently the 're-rank' is based on the surface forms!
        spellChecker.setAccuracy(0F);
        

        To explain with the Japanese case how the patch currently works, the spellchecker has two phases:

        • Phase 1: n-gram approximation phase. Here we generate a n-gram boolean query on the Readings. This is working fine.
        • Phase 2: re-rank phase. Here we take the candidates from Phase 1 and do a real comparison (e.g. Levenshtein) to give them the final score. The problem is this currently uses surface form!

        I think phase 2 should re-rank based on the 'analyzed form' too? Inside spellchecker itself, I don't think this is very difficult, when analyzed != surface, we just store it for later retrieval.

        The problem is the spellcheck comparison APIs such as SuggestWord don't even have any getters or setters and present no way for me to migrate to surface+analyzed in any backwards compatible way...

        I'll think about this in the meantime. Maybe we should just break and cleanup these APIs since its a contrib module and they are funky?

        Show
        Robert Muir added a comment - I updated the patch and fixed Koji's test, its passing BUT there is a nocommit: // nocommit: we need to fix SuggestWord to separate surface and analyzed forms. // currently the 're-rank' is based on the surface forms! spellChecker.setAccuracy(0F); To explain with the Japanese case how the patch currently works, the spellchecker has two phases: Phase 1: n-gram approximation phase. Here we generate a n-gram boolean query on the Readings. This is working fine. Phase 2: re-rank phase. Here we take the candidates from Phase 1 and do a real comparison (e.g. Levenshtein) to give them the final score. The problem is this currently uses surface form! I think phase 2 should re-rank based on the 'analyzed form' too? Inside spellchecker itself, I don't think this is very difficult, when analyzed != surface, we just store it for later retrieval. The problem is the spellcheck comparison APIs such as SuggestWord don't even have any getters or setters and present no way for me to migrate to surface+analyzed in any backwards compatible way... I'll think about this in the meantime. Maybe we should just break and cleanup these APIs since its a contrib module and they are funky?
        Hide
        Robert Muir added a comment -

        lemme see if I can help with the test. I feel bad I didn't supply one with the prototype patch.

        About the Solr integration: this looks good! We can use a similar approach for autosuggest, too,
        so this could configure the analyzer for LUCENE-3842.

        I wonder if we should allow separate configuration of "index" and "query" analyzers? I know
        I came up with some use-cases for that for autosuggest, but I'm not sure about spellchecking.
        I guess it wouldn't be overkill to allow it though.

        Show
        Robert Muir added a comment - lemme see if I can help with the test. I feel bad I didn't supply one with the prototype patch. About the Solr integration: this looks good! We can use a similar approach for autosuggest, too, so this could configure the analyzer for LUCENE-3842 . I wonder if we should allow separate configuration of "index" and "query" analyzers? I know I came up with some use-cases for that for autosuggest, but I'm not sure about spellchecking. I guess it wouldn't be overkill to allow it though.
        Hide
        Koji Sekiguchi added a comment -

        The test itself is not good.

        Show
        Koji Sekiguchi added a comment - The test itself is not good.
        Hide
        Christian Moen added a comment -

        This is excellent, Koji and Robert. We should be able to do basic spellchecking for Japanese with this.

        Show
        Christian Moen added a comment - This is excellent, Koji and Robert. We should be able to do basic spellchecking for Japanese with this.
        Hide
        Koji Sekiguchi added a comment -

        I added a test for the surface analyzer. I also added code for the analyzer in Solr.

        Currently, due to classpath problem, the test cannot be compiled. I should dig in, but if someone could, it would be appreciated.

        Show
        Koji Sekiguchi added a comment - I added a test for the surface analyzer. I also added code for the analyzer in Solr. Currently, due to classpath problem, the test cannot be compiled. I should dig in, but if someone could, it would be appreciated.
        Hide
        Robert Muir added a comment -

        fix the obvious reset() problem... the real problem is I need to reset() my coffee mug.

        Show
        Robert Muir added a comment - fix the obvious reset() problem... the real problem is I need to reset() my coffee mug.
        Hide
        Robert Muir added a comment -

        Here is a simple prototype of what I was suggesting, allows you to specify Analyzer to SpellChecker.

        This Analyzer converts the 'surface form' into 'analyzed form' at index and query time: at index-time it forms n-grams based on the analyzed form, but stores the surface form for retrieval.

        At query-time we have a similar process: the docFreq() etc checks are done on the surface form, but the actual spellchecking on the analyzed form.

        The default Analyzer is null which means do nothing, and the patch has no tests, refactoring, or any of that.

        Show
        Robert Muir added a comment - Here is a simple prototype of what I was suggesting, allows you to specify Analyzer to SpellChecker. This Analyzer converts the 'surface form' into 'analyzed form' at index and query time: at index-time it forms n-grams based on the analyzed form, but stores the surface form for retrieval. At query-time we have a similar process: the docFreq() etc checks are done on the surface form, but the actual spellchecking on the analyzed form. The default Analyzer is null which means do nothing, and the patch has no tests, refactoring, or any of that.
        Hide
        Robert Muir added a comment -

        Koji: hmm I think the problem is not in the Dictionary interface (which is actually ok),
        but instead in the spellcheckers and suggesters themselves?

        For spellchecking, I think we need to expose more Analysis options in Spellchecker:
        currently this is actually hardcoded at KeywordAnalyzer (it uses NOT_ANALYZED).
        Instead I think you should be able to pass Analyzer: we would also
        have a TokenFilter for Japanese that replaces term text with Reading from ReadingAttribute.

        In the same way, suggest can analyze too. (LUCENE-3842 is already some work for that, especially
        with the idea to support Japanese this exact same way).

        So in short I think we should:

        1. create a TokenFilter (similar to BaseFormFilter) which copies ReadingAttribute into termAtt.
        2. refactor the 'n-gram analysis' in spellchecker to work on actual tokenstreams (this can
          also likely be implemented as tokenstreams), allowing user to set an Analyzer on Spellchecker
          to control how it analyzes text.
        3. continue to work on 'analysis for suggest' like LUCENE-3842.

        Note this use of analyzers in spellcheck/suggest is unrelated to Solr's current use of 'analyzers'
        which is only for some query manipulation and not very useful.

        Show
        Robert Muir added a comment - Koji: hmm I think the problem is not in the Dictionary interface (which is actually ok), but instead in the spellcheckers and suggesters themselves? For spellchecking, I think we need to expose more Analysis options in Spellchecker: currently this is actually hardcoded at KeywordAnalyzer (it uses NOT_ANALYZED). Instead I think you should be able to pass Analyzer: we would also have a TokenFilter for Japanese that replaces term text with Reading from ReadingAttribute. In the same way, suggest can analyze too. ( LUCENE-3842 is already some work for that, especially with the idea to support Japanese this exact same way). So in short I think we should: create a TokenFilter (similar to BaseFormFilter) which copies ReadingAttribute into termAtt. refactor the 'n-gram analysis' in spellchecker to work on actual tokenstreams (this can also likely be implemented as tokenstreams), allowing user to set an Analyzer on Spellchecker to control how it analyzes text. continue to work on 'analysis for suggest' like LUCENE-3842 . Note this use of analyzers in spellcheck/suggest is unrelated to Solr's current use of 'analyzers' which is only for some query manipulation and not very useful.
        Hide
        Koji Sekiguchi added a comment -

        The patch cannot be compiled now because I changed the return type of the method in Dictionary interface but all implemented classes have not been changed.

        Please give some comment because I'm new to spell checker. If no problem to go, I'll continue to work.

        Show
        Koji Sekiguchi added a comment - The patch cannot be compiled now because I changed the return type of the method in Dictionary interface but all implemented classes have not been changed. Please give some comment because I'm new to spell checker. If no problem to go, I'll continue to work.

          People

          • Assignee:
            Koji Sekiguchi
            Reporter:
            Koji Sekiguchi
          • Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development