Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5, 3.1, 4.0-ALPHA
    • Component/s: spellchecker
    • Labels:
      None

      Description

      Enhance the SpellCheckComponent to run in a distributed (sharded) environment.

      1. SOLR-785.patch
        13 kB
        Shalin Shekhar Mangar
      2. SOLR-785.patch
        46 kB
        Shalin Shekhar Mangar
      3. SOLR-785.patch
        44 kB
        Shalin Shekhar Mangar
      4. SOLR-785.patch
        44 kB
        Shalin Shekhar Mangar
      5. spelling-shard.patch
        5 kB
        Matthew Woytowitz

        Issue Links

          Activity

          Hide
          Matthew Woytowitz added a comment -

          This is a patch to enable spellchecking in the distributed setup.

          Show
          Matthew Woytowitz added a comment - This is a patch to enable spellchecking in the distributed setup.
          Hide
          Shalin Shekhar Mangar added a comment -

          Matthew, thanks for the patch. Can you please include a unit test?

          Also, I'm thinking that we could refactor the Lucene spell checker to fetch the suggestions without the edit distance and find out the top 'n' suggestions after performing an edit distance on the aggregator. What do you think?

          Show
          Shalin Shekhar Mangar added a comment - Matthew, thanks for the patch. Can you please include a unit test? Also, I'm thinking that we could refactor the Lucene spell checker to fetch the suggestions without the edit distance and find out the top 'n' suggestions after performing an edit distance on the aggregator. What do you think?
          Hide
          Shalin Shekhar Mangar added a comment -

          I'm writing a test for this patch.

          With only one shard (say localhost:57369/solr), I get correct results if I do not specify shards parameter. If I specify shards=localhost:57369/solr on the same Solr, I get a different result. Debugging now.

          Show
          Shalin Shekhar Mangar added a comment - I'm writing a test for this patch. With only one shard (say localhost:57369/solr), I get correct results if I do not specify shards parameter. If I specify shards=localhost:57369/solr on the same Solr, I get a different result. Debugging now.
          Hide
          Shalin Shekhar Mangar added a comment -

          Well, there is only finishStage in the patch which is of no use alone. SpellCheckComponent needs to either generate shard requests or piggyback on existing requests to merge the data in finishStage.

          This brings up interesting questions - How do we perform distributed search for SearchComponents which are not added by default to SearchHandler? Only if the chain has QueryComponent, can we modify its requests. Otherwise we must issue new ShardRequests. Is assuming that configuration of the search component chain is the same between all shards OK?

          Show
          Shalin Shekhar Mangar added a comment - Well, there is only finishStage in the patch which is of no use alone. SpellCheckComponent needs to either generate shard requests or piggyback on existing requests to merge the data in finishStage. This brings up interesting questions - How do we perform distributed search for SearchComponents which are not added by default to SearchHandler? Only if the chain has QueryComponent, can we modify its requests. Otherwise we must issue new ShardRequests. Is assuming that configuration of the search component chain is the same between all shards OK?
          Hide
          Shalin Shekhar Mangar added a comment -

          As noted on solr-dev, I had misunderstood the behavior. The problem was that I had configured a separate handler for spellcheck and I needed to pass shards.qt to make it work with distributed search.

          So distributed spell check is working and I'm trying to write tests making sure that the response returned by distributed spellcheck is the same as the one returned by non-distributed spellcheck for the various spellcheck parameters.

          In the current patch, distributed spellcheck always returns correctlySpelled=false which non-distributed mode does not. I'll have a patch ready soon with the test and fixes.

          Show
          Shalin Shekhar Mangar added a comment - As noted on solr-dev, I had misunderstood the behavior. The problem was that I had configured a separate handler for spellcheck and I needed to pass shards.qt to make it work with distributed search. So distributed spell check is working and I'm trying to write tests making sure that the response returned by distributed spellcheck is the same as the one returned by non-distributed spellcheck for the various spellcheck parameters. In the current patch, distributed spellcheck always returns correctlySpelled=false which non-distributed mode does not. I'll have a patch ready soon with the test and fixes.
          Hide
          Shalin Shekhar Mangar added a comment -

          There are a couple of problems with the current patch:

          1. It breaks ties using frequency info even if onlyMorePopular=false
          2. If more than one suggestion are returned by a shard (extendedResults=false), it uses the first one and discards the rest
          3. It duplicates a lot of response writing code

          Ideally, we'd like to:

          1. Share the same response writing code - Construct a SpellingResult using the shard responses and pass it to the existing toNamedList method so that there is no discrepancy between distributed and non-distributed mode
          2. Break ties by the configured distance measure. If distance is same and onlyMorePopular=true, then break ties by frequency (same as how Lucene SpellChecker breaks ties)
          Show
          Shalin Shekhar Mangar added a comment - There are a couple of problems with the current patch: It breaks ties using frequency info even if onlyMorePopular=false If more than one suggestion are returned by a shard (extendedResults=false), it uses the first one and discards the rest It duplicates a lot of response writing code Ideally, we'd like to: Share the same response writing code - Construct a SpellingResult using the shard responses and pass it to the existing toNamedList method so that there is no discrepancy between distributed and non-distributed mode Break ties by the configured distance measure. If distance is same and onlyMorePopular=true, then break ties by frequency (same as how Lucene SpellChecker breaks ties)
          Hide
          Shalin Shekhar Mangar added a comment -
          1. Break ties exactly as Lucene SpellChecker
          2. Merge frequency info of original tokens as well as suggestions
          3. Create SpellingResult objects after merging data from shards and pass to #toNamedList to write response

          This is a half baked patch and not fit to be committed. Problems include (but not limited to!) :

          1. The test case fails for extendedResults=false (something to do how original frequency is added to SpellingResult) as it returns correctlySpelled=false when it shouldn't.
          2. Spellcheck Index corruption in stress tests - race conditions in SpellCheckComponent I guess
          3. Not optimized at all - quick hack

          I'm putting this out because I'll not be able to look at this for the next two days. If someone wants to take this out for a spin, please do.

          Show
          Shalin Shekhar Mangar added a comment - Break ties exactly as Lucene SpellChecker Merge frequency info of original tokens as well as suggestions Create SpellingResult objects after merging data from shards and pass to #toNamedList to write response This is a half baked patch and not fit to be committed. Problems include (but not limited to!) : The test case fails for extendedResults=false (something to do how original frequency is added to SpellingResult) as it returns correctlySpelled=false when it shouldn't. Spellcheck Index corruption in stress tests - race conditions in SpellCheckComponent I guess Not optimized at all - quick hack I'm putting this out because I'll not be able to look at this for the next two days. If someone wants to take this out for a spin, please do.
          Hide
          Shalin Shekhar Mangar added a comment -
          1. Fixes bugs with response format
          2. Fixes another bug with wrong order of results

          We need to take another look at how the rebuild happens. There seem to be some race conditions which show up when we do stress testing. For now, this patch disables stress tests. I'll investigate and open other issues for that.

          Show
          Shalin Shekhar Mangar added a comment - Fixes bugs with response format Fixes another bug with wrong order of results We need to take another look at how the rebuild happens. There seem to be some race conditions which show up when we do stress testing. For now, this patch disables stress tests. I'll investigate and open other issues for that.
          Hide
          Shalin Shekhar Mangar added a comment -

          We need to take another look at how the rebuild happens. There seem to be some race conditions which show up when we do stress testing. For now, this patch disables stress tests. I'll investigate and open other issues for that.

          I just noticed the LUCENE-2108 commit. After upgrading the contrib spellcheck jar to lucene trunk, these issues are no longer reproducible. Those changes are not yet ported to the Lucene 2.9 branch but will soon be.

          Do we need to keep Lucene core and contrib versions in sync? It seems like we'll need to upgrade contrib-spellcheck. Is it necessary to upgrade both core and contrib to latest 2.9 branch or can we just go ahead and upgrade spellcheck (perhaps even to 3.0 since it is compatible).

          Show
          Shalin Shekhar Mangar added a comment - We need to take another look at how the rebuild happens. There seem to be some race conditions which show up when we do stress testing. For now, this patch disables stress tests. I'll investigate and open other issues for that. I just noticed the LUCENE-2108 commit. After upgrading the contrib spellcheck jar to lucene trunk, these issues are no longer reproducible. Those changes are not yet ported to the Lucene 2.9 branch but will soon be. Do we need to keep Lucene core and contrib versions in sync? It seems like we'll need to upgrade contrib-spellcheck. Is it necessary to upgrade both core and contrib to latest 2.9 branch or can we just go ahead and upgrade spellcheck (perhaps even to 3.0 since it is compatible).
          Hide
          Simon Willnauer added a comment -

          FYI

          Those changes are not yet ported to the Lucene 2.9 branch but will soon be.

          Done!

          Show
          Simon Willnauer added a comment - FYI Those changes are not yet ported to the Lucene 2.9 branch but will soon be. Done!
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Simon!

          1. Overrides modifyRequest to enable spellcheck only in GET_TOP_IDS (otherwise we do the spellcheck for each request)
          2. Request for at least 5 suggestions from each shard for more accurate merged results
          3. Re-enables stress testing (requires LUCENE-2108)
          4. Use LinkedHashMap for "origVsSuggested" map to preserve order of tokens because collation algorithm only works in-order.

          All tests pass.

          This can definitely be optimized further. The four maps can be removed if we can collect by iterating over the tokens in original query (we may need one to cache SpellCheckResponse). However, I have bigger itches to scratch so I'll leave this here.

          I'll commit this after a day or two in case somebody wants to review the patch (or improve it).

          Show
          Shalin Shekhar Mangar added a comment - Thanks Simon! Overrides modifyRequest to enable spellcheck only in GET_TOP_IDS (otherwise we do the spellcheck for each request) Request for at least 5 suggestions from each shard for more accurate merged results Re-enables stress testing (requires LUCENE-2108 ) Use LinkedHashMap for "origVsSuggested" map to preserve order of tokens because collation algorithm only works in-order. All tests pass. This can definitely be optimized further. The four maps can be removed if we can collect by iterating over the tokens in original query (we may need one to cache SpellCheckResponse). However, I have bigger itches to scratch so I'll leave this here. I'll commit this after a day or two in case somebody wants to review the patch (or improve it).
          Hide
          Shalin Shekhar Mangar added a comment -

          Updating for SOLR-1608 commit.

          Show
          Shalin Shekhar Mangar added a comment - Updating for SOLR-1608 commit.
          Hide
          Shalin Shekhar Mangar added a comment -

          Committed revision 888796.

          Thanks Matthew!

          Show
          Shalin Shekhar Mangar added a comment - Committed revision 888796. Thanks Matthew!
          Hide
          Hoss Man added a comment -

          Correcting Fix Version based on CHANGES.txt, see this thread for more details...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Show
          Hoss Man added a comment - Correcting Fix Version based on CHANGES.txt, see this thread for more details... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1.0 release

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1.0 release

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Shalin Shekhar Mangar
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development