Lucene - Core
  1. Lucene - Core
  2. LUCENE-2995

factor out a shared spellchecking module

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      In lucene's contrib we have spellchecking support (index-based spellchecker, directspellchecker, etc).
      we also have some things like pluggable comparators.

      In solr we have auto-suggest support (with two implementations it looks like), some good utilities like HighFrequencyDictionary, etc.

      I think spellchecking is really important... google has upped the ante to what users expect.
      So I propose we combine all this stuff into a shared modules/spellchecker, which will make it easier
      to refactor and improve the quality.

      1. LUCENE-2995-diff.patch
        483 kB
        Robert Muir
      2. LUCENE-2995.patch
        71 kB
        Robert Muir
      3. LUCENE-2995.patch
        54 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          Or rather than revert... just iterate it forward until it satisfies
          everybody's concerns. Why keep moving backwards?

          +1. Reverting itself should not feel wrong - asking others to revert should IMO. The default mode should be moving forward, not backward. Everything is open to change and improvement - asking for a revert from someone else will always be considered a very big deal by me. Unless it's a special circumstance I'm going to rant and rave about it every time myself.

          Show
          Mark Miller added a comment - Or rather than revert... just iterate it forward until it satisfies everybody's concerns. Why keep moving backwards? +1. Reverting itself should not feel wrong - asking others to revert should IMO. The default mode should be moving forward, not backward. Everything is open to change and improvement - asking for a revert from someone else will always be considered a very big deal by me. Unless it's a special circumstance I'm going to rant and rave about it every time myself.
          Hide
          Robert Muir added a comment -

          Committed revision 1126642.

          I'd like to expose this to 3.x users as well, but this is more complicated in this case so I'll open some followup issues for this.

          Show
          Robert Muir added a comment - Committed revision 1126642. I'd like to expose this to 3.x users as well, but this is more complicated in this case so I'll open some followup issues for this.
          Hide
          Robert Muir added a comment -

          I agree, I would much rather iterate forward if anyone has concerns.

          I will mark this issue resolved, we can open separate issues for any problems.
          I know of a few followup issues I will open myself.

          Show
          Robert Muir added a comment - I agree, I would much rather iterate forward if anyone has concerns. I will mark this issue resolved, we can open separate issues for any problems. I know of a few followup issues I will open myself.
          Hide
          Greg Stein added a comment -

          Or rather than revert... just iterate it forward until it satisfies
          everybody's concerns. Why keep moving backwards?

          Show
          Greg Stein added a comment - Or rather than revert... just iterate it forward until it satisfies everybody's concerns. Why keep moving backwards?
          Hide
          Simon Willnauer added a comment -

          I'd like to commit this one before this happens, and if anyone has concerns or objections I'll just revert and we can revisit.

          looks good after a quick scan... I think you should commit and we iterate on it once its in even if we need to revert. Reverting should not feel wrong so here is my +1 to commit

          Show
          Simon Willnauer added a comment - I'd like to commit this one before this happens, and if anyone has concerns or objections I'll just revert and we can revisit. looks good after a quick scan... I think you should commit and we iterate on it once its in even if we need to revert. Reverting should not feel wrong so here is my +1 to commit
          Hide
          Robert Muir added a comment -

          these patches go completely out of date fast... (I had to redo the previous patch from scratch basically).

          I'd like to commit this one before this happens, and if anyone has concerns or objections I'll just revert and we can revisit.

          Show
          Robert Muir added a comment - these patches go completely out of date fast... (I had to redo the previous patch from scratch basically). I'd like to commit this one before this happens, and if anyone has concerns or objections I'll just revert and we can revisit.
          Hide
          Robert Muir added a comment -

          here is the previous patch as an ordinary diff, maybe easier to apply.

          Show
          Robert Muir added a comment - here is the previous patch as an ordinary diff, maybe easier to apply.
          Hide
          Robert Muir added a comment -

          Updated patch:

          • Supports Dawid's new FST autosuggester
          • Ported the low-level tests over to the module
          • Backwards compatibility with older Solr config files

          before applying the patch, please use the following script:

          svn move lucene/contrib/spellchecker modules/suggest
          svn move solr/src/java/org/apache/solr/util/HighFrequencyDictionary.java modules/suggest/src/java/org/apache/lucene/search/spell
          svn move solr/src/java/org/apache/solr/util/TermFreqIterator.java modules/suggest/src/java/org/apache/lucene/search/spell
          svn move solr/src/java/org/apache/solr/util/SortedIterator.java modules/suggest/src/java/org/apache/lucene/search/spell
          svn move solr/src/java/org/apache/solr/spelling/suggest/Suggester.java solr/src/java/org/apache/solr/spelling
          svn move solr/src/java/org/apache/solr/spelling/suggest modules/suggest/src/java/org/apache/lucene/search/spell
          svn move modules/suggest/src/java/org/apache/lucene/search/spell/suggest modules/suggest/src/java/org/apache/lucene/search
          svn revert solr/src/java/org/apache/solr/spelling/suggest
          svn mkdir --parents modules/suggest/src/test/org/apache/lucene/search/suggest/
          svn move solr/src/java/org/apache/solr/spelling/Suggester.java solr/src/java/org/apache/solr/spelling/suggest
          svn revert solr/src/java/org/apache/solr/spelling/suggest/fst
          svn revert solr/src/java/org/apache/solr/spelling/suggest/jaspell
          svn revert solr/src/java/org/apache/solr/spelling/suggest/tst
          svn mkdir --parents modules/suggest/src/test/org/apache/lucene/search/suggest/
          svn move solr/src/test/org/apache/solr/spelling/suggest/Average.java modules/suggest/src/test/org/apache/lucene/search/suggest/
          svn move solr/src/test/org/apache/solr/spelling/suggest/LookupBenchmarkTest.java modules/suggest/src/test/org/apache/lucene/search/suggest/
          svn move solr/src/test/org/apache/solr/spelling/suggest/PersistenceTest.java modules/suggest/src/test/org/apache/lucene/search/suggest/
          svn move solr/src/test/org/apache/solr/spelling/suggest/TermFreq.java modules/suggest/src/test/org/apache/lucene/search/suggest/
          svn move solr/src/test/org/apache/solr/spelling/suggest/TermFreqArrayIterator.java modules/suggest/src/test/org/apache/lucene/search/suggest/
          svn move solr/src/test/org/apache/solr/spelling/suggest/fst modules/suggest/src/test/org/apache/lucene/search/suggest/fst
          svn move solr/src/test-files/Top50KWiki.utf8 modules/suggest/src/test/org/apache/lucene/search/suggest
          
          Show
          Robert Muir added a comment - Updated patch: Supports Dawid's new FST autosuggester Ported the low-level tests over to the module Backwards compatibility with older Solr config files before applying the patch, please use the following script: svn move lucene/contrib/spellchecker modules/suggest svn move solr/src/java/org/apache/solr/util/HighFrequencyDictionary.java modules/suggest/src/java/org/apache/lucene/search/spell svn move solr/src/java/org/apache/solr/util/TermFreqIterator.java modules/suggest/src/java/org/apache/lucene/search/spell svn move solr/src/java/org/apache/solr/util/SortedIterator.java modules/suggest/src/java/org/apache/lucene/search/spell svn move solr/src/java/org/apache/solr/spelling/suggest/Suggester.java solr/src/java/org/apache/solr/spelling svn move solr/src/java/org/apache/solr/spelling/suggest modules/suggest/src/java/org/apache/lucene/search/spell svn move modules/suggest/src/java/org/apache/lucene/search/spell/suggest modules/suggest/src/java/org/apache/lucene/search svn revert solr/src/java/org/apache/solr/spelling/suggest svn mkdir --parents modules/suggest/src/test/org/apache/lucene/search/suggest/ svn move solr/src/java/org/apache/solr/spelling/Suggester.java solr/src/java/org/apache/solr/spelling/suggest svn revert solr/src/java/org/apache/solr/spelling/suggest/fst svn revert solr/src/java/org/apache/solr/spelling/suggest/jaspell svn revert solr/src/java/org/apache/solr/spelling/suggest/tst svn mkdir --parents modules/suggest/src/test/org/apache/lucene/search/suggest/ svn move solr/src/test/org/apache/solr/spelling/suggest/Average.java modules/suggest/src/test/org/apache/lucene/search/suggest/ svn move solr/src/test/org/apache/solr/spelling/suggest/LookupBenchmarkTest.java modules/suggest/src/test/org/apache/lucene/search/suggest/ svn move solr/src/test/org/apache/solr/spelling/suggest/PersistenceTest.java modules/suggest/src/test/org/apache/lucene/search/suggest/ svn move solr/src/test/org/apache/solr/spelling/suggest/TermFreq.java modules/suggest/src/test/org/apache/lucene/search/suggest/ svn move solr/src/test/org/apache/solr/spelling/suggest/TermFreqArrayIterator.java modules/suggest/src/test/org/apache/lucene/search/suggest/ svn move solr/src/test/org/apache/solr/spelling/suggest/fst modules/suggest/src/test/org/apache/lucene/search/suggest/fst svn move solr/src/test-files/Top50KWiki.utf8 modules/suggest/src/test/org/apache/lucene/search/suggest
          Hide
          Yonik Seeley added a comment -

          I understand Greg's objection - given that the projects are currently one, what I said was not a technical argument against the patch.
          If Lucene and Solr do end up splitting, Solr could always copy (or start at an earlier revision) any files that it needed more control over anyway.

          My -1 is rescinded.

          Show
          Yonik Seeley added a comment - I understand Greg's objection - given that the projects are currently one, what I said was not a technical argument against the patch. If Lucene and Solr do end up splitting, Solr could always copy (or start at an earlier revision) any files that it needed more control over anyway. My -1 is rescinded.
          Hide
          Steve Rowe added a comment -

          Yonik: that is an improper veto, and does not stand.

          Greg, could you please be more specific? Why is Yonik's veto improper?

          Show
          Steve Rowe added a comment - Yonik: that is an improper veto, and does not stand. Greg, could you please be more specific? Why is Yonik's veto improper?
          Hide
          Greg Stein added a comment -

          Yonik: that is an improper veto, and does not stand.

          Show
          Greg Stein added a comment - Yonik: that is an improper veto, and does not stand.
          Hide
          Yonik Seeley added a comment -

          -1 to this for now.
          The interpretation of what it means to be merged has taken a turn for the worse, with solr features being blocked (see SOLR-2272).
          I fear we may need to split Solr into it's own TLP, and hence we should be doing any further refactoring at this time since if Solr ends up as it's own project, it will want these files to remain under it's control.

          Show
          Yonik Seeley added a comment - -1 to this for now. The interpretation of what it means to be merged has taken a turn for the worse, with solr features being blocked (see SOLR-2272 ). I fear we may need to split Solr into it's own TLP, and hence we should be doing any further refactoring at this time since if Solr ends up as it's own project, it will want these files to remain under it's control.
          Hide
          Dawid Weiss added a comment -

          The patch probably won't apply after I added that automaton-based suggester to Solr, but I didn't have a chance to peek at it yet.

          Show
          Dawid Weiss added a comment - The patch probably won't apply after I added that automaton-based suggester to Solr, but I didn't have a chance to peek at it yet.
          Hide
          Michael McCandless added a comment -

          Is this ready to go in....?

          Show
          Michael McCandless added a comment - Is this ready to go in....?
          Hide
          Grant Ingersoll added a comment -

          if that's in the scope for what you guys have in mind for this module, go ahead.

          It's in the back of my head. I've got Mahout collab. filtering hooked up through Solr already and it would be dead simple to bring in here, too, but it would fit nicely in this framework. For instance, given a set of search results, it can go do Item-Item recommendations based on doc-ids.

          suggest

          +1. Simple, to the point and has room to grow.

          Show
          Grant Ingersoll added a comment - if that's in the scope for what you guys have in mind for this module, go ahead. It's in the back of my head. I've got Mahout collab. filtering hooked up through Solr already and it would be dead simple to bring in here, too, but it would fit nicely in this framework. For instance, given a set of search results, it can go do Item-Item recommendations based on doc-ids. suggest +1. Simple, to the point and has room to grow.
          Hide
          Michael McCandless added a comment -

          How about just "suggest"?

          I don't like suggester because I'm really not sure if it's spelled suggestor and I feel whatever name we choose here damned well better be easy to spell!!

          Show
          Michael McCandless added a comment - How about just "suggest"? I don't like suggester because I'm really not sure if it's spelled suggestor and I feel whatever name we choose here damned well better be easy to spell!!
          Hide
          Steve Rowe added a comment -

          Or maybe "did-you-mean" or "instead"?

          Show
          Steve Rowe added a comment - Or maybe "did-you-mean" or "instead"?
          Hide
          Steve Rowe added a comment -

          maybe "query-suggester" or "term-suggester" ?

          Or maybe "reword" or "rephrase"?

          Show
          Steve Rowe added a comment - maybe "query-suggester" or "term-suggester" ? Or maybe "reword" or "rephrase"?
          Hide
          Hoss Man added a comment -

          i don't really have any better suggestions (ack, i hate puns unintentional or otherwise) on the name, but the one word of caution i would put out there is that "suggestions" and "suggester" are vague about what they "suggest"

          people might confuse this with "morelikethis" or "recommendation engine" type stuff.

          if that's in the scope for what you guys have in mind for this module, go ahead.

          if not .... maybe "query-suggester" or "term-suggester" ?

          Show
          Hoss Man added a comment - i don't really have any better suggestions (ack, i hate puns unintentional or otherwise) on the name, but the one word of caution i would put out there is that "suggestions" and "suggester" are vague about what they "suggest" people might confuse this with "morelikethis" or "recommendation engine" type stuff. if that's in the scope for what you guys have in mind for this module, go ahead. if not .... maybe "query-suggester" or "term-suggester" ?
          Hide
          Grant Ingersoll added a comment -

          yeah, I like suggestions or suggester

          Show
          Grant Ingersoll added a comment - yeah, I like suggestions or suggester
          Hide
          Andrzej Bialecki added a comment -

          +1 for a "suggester" module.

          Show
          Andrzej Bialecki added a comment - +1 for a "suggester" module.
          Hide
          Robert Muir added a comment -

          Grant should we call the module suggestor or suggestions or something instead of spellcheck?

          I didnt spend a lot of time thinking about the name in the patch, but thats a good point.

          Show
          Robert Muir added a comment - Grant should we call the module suggestor or suggestions or something instead of spellcheck? I didnt spend a lot of time thinking about the name in the patch, but thats a good point.
          Hide
          Grant Ingersoll added a comment -

          See also SOLR-2080. Spell checking, suggestions and related searches are all part of what I would call a Suggester framework or a Discovery framework. Doesn't need to be done here, but I do think it's easy to have a common API for all of these "suggestions", especially if we can factor in user feedback into them, as right now, we only solve 1/2 of the problem.

          Show
          Grant Ingersoll added a comment - See also SOLR-2080 . Spell checking, suggestions and related searches are all part of what I would call a Suggester framework or a Discovery framework. Doesn't need to be done here, but I do think it's easy to have a common API for all of these "suggestions", especially if we can factor in user feedback into them, as right now, we only solve 1/2 of the problem.
          Hide
          Robert Muir added a comment -

          well there's no rush on this issue, i just wanted to throw out the idea...

          I think it definitely makes sense if you are going to be implementing cool FST stuff for suggesting etc!

          i saw a lot of opportunities for refactoring just by combining the code in one place.

          Show
          Robert Muir added a comment - well there's no rush on this issue, i just wanted to throw out the idea... I think it definitely makes sense if you are going to be implementing cool FST stuff for suggesting etc! i saw a lot of opportunities for refactoring just by combining the code in one place.
          Hide
          Dawid Weiss added a comment -

          Ok, great. I'll look into this this week.

          Show
          Dawid Weiss added a comment - Ok, great. I'll look into this this week.
          Hide
          Michael McCandless added a comment -

          This is awesome Robert! Spell checking is a hugely important feature...

          Show
          Michael McCandless added a comment - This is awesome Robert! Spell checking is a hugely important feature...
          Hide
          Robert Muir added a comment -

          I think so? This issue just moves the code around, e.g. your FST-based Lookup
          impl would ultimately sit under modules/spellchecker instead of Solr, and lucene
          users would be able to use it, too.

          Show
          Robert Muir added a comment - I think so? This issue just moves the code around, e.g. your FST-based Lookup impl would ultimately sit under modules/spellchecker instead of Solr, and lucene users would be able to use it, too.
          Hide
          Dawid Weiss added a comment -

          This is a related issue in SOLR that I plan to work on this week, Robert. Is it o.k.?

          Show
          Dawid Weiss added a comment - This is a related issue in SOLR that I plan to work on this week, Robert. Is it o.k.?
          Hide
          Simon Willnauer added a comment -

          Robert, thanks for starting this and pushing up a patch. I think its a great start and we should try to let lots of other stuff follow ASAP.

          Really any serious 'refactoring' e.g. perf improvements should be on followup issues I think.

          +1

          Show
          Simon Willnauer added a comment - Robert, thanks for starting this and pushing up a patch. I think its a great start and we should try to let lots of other stuff follow ASAP. Really any serious 'refactoring' e.g. perf improvements should be on followup issues I think. +1
          Hide
          Chris Male added a comment -

          +1

          Show
          Chris Male added a comment - +1
          Hide
          Robert Muir added a comment -

          Just a quick shot at this (all tests pass).

          Really any serious 'refactoring' e.g. perf improvements should be on followup issues I think.

          before applying the patch, run this:

          svn move lucene/contrib/spellchecker modules
          svn move solr/src/java/org/apache/solr/util/HighFrequencyDictionary.java modules/spellchecker/src/java/org/apache/lucene/search/spell
          svn move solr/src/java/org/apache/solr/util/TermFreqIterator.java modules/spellchecker/src/java/org/apache/lucene/search/spell
          svn move solr/src/java/org/apache/solr/util/SortedIterator.java modules/spellchecker/src/java/org/apache/lucene/search/spell
          svn move solr/src/java/org/apache/solr/spelling/suggest/Suggester.java solr/src/java/org/apache/solr/spelling
          svn move solr/src/java/org/apache/solr/spelling/suggest modules/spellchecker/src/java/org/apache/lucene/search/spell
          
          Show
          Robert Muir added a comment - Just a quick shot at this (all tests pass). Really any serious 'refactoring' e.g. perf improvements should be on followup issues I think. before applying the patch, run this: svn move lucene/contrib/spellchecker modules svn move solr/src/java/org/apache/solr/util/HighFrequencyDictionary.java modules/spellchecker/src/java/org/apache/lucene/search/spell svn move solr/src/java/org/apache/solr/util/TermFreqIterator.java modules/spellchecker/src/java/org/apache/lucene/search/spell svn move solr/src/java/org/apache/solr/util/SortedIterator.java modules/spellchecker/src/java/org/apache/lucene/search/spell svn move solr/src/java/org/apache/solr/spelling/suggest/Suggester.java solr/src/java/org/apache/solr/spelling svn move solr/src/java/org/apache/solr/spelling/suggest modules/spellchecker/src/java/org/apache/lucene/search/spell

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development