Solr
  1. Solr
  2. SOLR-4280

spellcheck.maxResultsForSuggest based on filter query results

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5
    • Component/s: spellchecker
    • Labels:
      None

      Description

      spellcheck.maxResultsForSuggest takes a fixed number but ideally should be able to take a ratio and calculate that against the maximum number of results the filter queries return.

      At least in our case this would certainly add a lot of value. >99% of our end-users search within one or more filters of which one is always unique. The number of documents for each of those unique filters varies significantly ranging from 300 to 3.000.000 documents in which they search. The maxResultsForSuggest is set to a reasonable low value so it kind of works fine but sometimes leads to undesired suggestions for a large subcorpus that has more misspellings.

      Spun off from SOLR-4278.

      1. SOLR-4280.patch
        14 kB
        James Dyer
      2. SOLR-4280.patch
        13 kB
        Markus Jelsma
      3. SOLR-4280.patch
        11 kB
        James Dyer
      4. SOLR-4280-trunk.patch
        5 kB
        Markus Jelsma
      5. SOLR-4280-trunk.patch
        5 kB
        Markus Jelsma
      6. SOLR-4280-trunk-1.patch
        5 kB
        Markus Jelsma

        Issue Links

          Activity

          Hide
          Markus Jelsma added a comment -

          Patch for trunk introducing a spellcheck.percentageResultsForSuggest. It uses the filterCache to check the maximum number of possible results so whether a term is misspelled relies on how large the maximum result set is and the value for this parameter.

          Since the filterCache cannot be retrieved from SolrIndexSearcher.getCache() at this moment you'll have to hack into it and have it add the filterCache to the cacheMap somewhere in the constructor.

          cacheMap.put(filterCache.name(), filterCache);
          
          Show
          Markus Jelsma added a comment - Patch for trunk introducing a spellcheck.percentageResultsForSuggest. It uses the filterCache to check the maximum number of possible results so whether a term is misspelled relies on how large the maximum result set is and the value for this parameter. Since the filterCache cannot be retrieved from SolrIndexSearcher.getCache() at this moment you'll have to hack into it and have it add the filterCache to the cacheMap somewhere in the constructor. cacheMap.put(filterCache.name(), filterCache);
          Hide
          James Dyer added a comment -

          Markus,

          On SOLR-4278, you said,

          It would be helpful if we can take the number of hits for a single filter from the filterCache itself.

          I think the key here is "single filter". I think it would be possible for a user to misuse this feature and have it constantly adding entries to the filterCache to check if different scenarios have a high enough percentage to get a suggestion. Can we lock it down so users won't have to deal with unexpected bad performance in this case?

          Show
          James Dyer added a comment - Markus, On SOLR-4278 , you said, It would be helpful if we can take the number of hits for a single filter from the filterCache itself. I think the key here is "single filter". I think it would be possible for a user to misuse this feature and have it constantly adding entries to the filterCache to check if different scenarios have a high enough percentage to get a suggestion. Can we lock it down so users won't have to deal with unexpected bad performance in this case?
          Hide
          Markus Jelsma added a comment -

          Hi James,

          I'm not sure that i follow. This patch only obtains the number of results for a given. If the user has a single filter in the query lang:nl this patch looks up the number of results for that filter only. The same is true for multiple filters e.g. fq=lang:en&fq=host:apache.org, it just iterates over the these filters in the cache and gets the number of documents they can return.

          How would a user misuse this feature? This patch does not write to the filterCache, users do so by adding fq-parameters.

          Thanks

          Show
          Markus Jelsma added a comment - Hi James, I'm not sure that i follow. This patch only obtains the number of results for a given. If the user has a single filter in the query lang:nl this patch looks up the number of results for that filter only. The same is true for multiple filters e.g. fq=lang:en&fq=host:apache.org, it just iterates over the these filters in the cache and gets the number of documents they can return. How would a user misuse this feature? This patch does not write to the filterCache, users do so by adding fq-parameters. Thanks
          Hide
          James Dyer added a comment -

          I'm thing I misunderstood initially. Thanks for the clarification.

          Show
          James Dyer added a comment - I'm thing I misunderstood initially. Thanks for the clarification.
          Hide
          Markus Jelsma added a comment -

          I forgot i had a working patch laying around. Specify spellcheck.percentageResultsForSuggest=0.25 to force maxResultsForSuggest to be 25% of the smallest filterQuery DocSet. This allows maxResultsForSuggest to be adjusted dynamically based on the filters specified.

          It doesn't seem to work in a distributed environment although the parameters are passed nicely. I haven't figured that out yet, but all shards return the same collation for undistributed requests. Tips?

          Show
          Markus Jelsma added a comment - I forgot i had a working patch laying around. Specify spellcheck.percentageResultsForSuggest=0.25 to force maxResultsForSuggest to be 25% of the smallest filterQuery DocSet. This allows maxResultsForSuggest to be adjusted dynamically based on the filters specified. It doesn't seem to work in a distributed environment although the parameters are passed nicely. I haven't figured that out yet, but all shards return the same collation for undistributed requests. Tips?
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Markus Jelsma added a comment -

          New patch. This patch now also works in a distributed environment.

          Show
          Markus Jelsma added a comment - New patch. This patch now also works in a distributed environment.
          Hide
          Uwe Schindler added a comment -

          Move issue to Solr 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Solr 4.9.
          Hide
          James Dyer added a comment -

          Here is an updated patch for Trunk. I've included unit tests and changed javadoc to reflect the added functionality. I've also modified how this gets triggered. Rather than introduce a new request parameter, the user passes in "spellcheck.maxResultsForSuggest" as a fractional percent, between 0 and 1. So if the user wants no more than 5% of the most-selective filter's results to be the maximum results to trigger suggestions, they would specify "spellcheck.maxResultsForSuggest=.05". If, for instance, the most-selective filter returns (by itself) 100 documents, then the effective maximum number of hits we will return without triggering spelling suggestions is 5.

          Markus Jelsma does this all sound right to you? Is this still a feature you want and would be interested in seeing committed?

          Show
          James Dyer added a comment - Here is an updated patch for Trunk. I've included unit tests and changed javadoc to reflect the added functionality. I've also modified how this gets triggered. Rather than introduce a new request parameter, the user passes in "spellcheck.maxResultsForSuggest" as a fractional percent, between 0 and 1. So if the user wants no more than 5% of the most-selective filter's results to be the maximum results to trigger suggestions, they would specify "spellcheck.maxResultsForSuggest=.05". If, for instance, the most-selective filter returns (by itself) 100 documents, then the effective maximum number of hits we will return without triggering spelling suggestions is 5. Markus Jelsma does this all sound right to you? Is this still a feature you want and would be interested in seeing committed?
          Hide
          Markus Jelsma added a comment -

          Hello James Dyer - this sounds very good. However, an addition to this feature would be the option to also choose which filter the fraction operates on. I have seen some strange results when drilling deeper using more and more restrictive filters.

          It was originally meant to use maxResultsForSuggest in a multi-tenant index. The current maxResultsForSuggest is not suitable for clients having a very large index and clients having a small index.

          How about an optional spellcheck.maxResultsForSuggest.fq=field:value. If the user specifies this, the patch won't need to find the most restrictive filter.

          M.

          Show
          Markus Jelsma added a comment - Hello James Dyer - this sounds very good. However, an addition to this feature would be the option to also choose which filter the fraction operates on. I have seen some strange results when drilling deeper using more and more restrictive filters. It was originally meant to use maxResultsForSuggest in a multi-tenant index. The current maxResultsForSuggest is not suitable for clients having a very large index and clients having a small index. How about an optional spellcheck.maxResultsForSuggest.fq=field:value. If the user specifies this, the patch won't need to find the most restrictive filter. M.
          Hide
          James Dyer added a comment -

          Markus Jelsma Are you able to take the updated patch and add the additional functionality you suggest? I agree that the "most-restrictive" filter might not serve everyone's needs, but all-in-all this might be a nice feature for multi-tenant situations.

          Show
          James Dyer added a comment - Markus Jelsma Are you able to take the updated patch and add the additional functionality you suggest? I agree that the "most-restrictive" filter might not serve everyone's needs, but all-in-all this might be a nice feature for multi-tenant situations.
          Hide
          Markus Jelsma added a comment -

          Updated patch. Added SPELLCHECK_MAX_RESULTS_FOR_SUGGEST_FQ = spellcheck.maxResultsForSuggest.fq to take a filter query. Only used if maxResultsForSuggest is a fraction.

          Show
          Markus Jelsma added a comment - Updated patch. Added SPELLCHECK_MAX_RESULTS_FOR_SUGGEST_FQ = spellcheck.maxResultsForSuggest.fq to take a filter query. Only used if maxResultsForSuggest is a fraction.
          Hide
          James Dyer added a comment -

          Clean-up patch with slightly better testing, javadoc. Once I can run tests & precommit on it, I will commit this.

          Show
          James Dyer added a comment - Clean-up patch with slightly better testing, javadoc. Once I can run tests & precommit on it, I will commit this.
          Hide
          ASF subversion and git services added a comment -

          Commit 1720636 from jdyer@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1720636 ]

          SOLR-4280: Allow specifying "spellcheck.maxResultsForSuggest" as a percentage of filter query results

          Show
          ASF subversion and git services added a comment - Commit 1720636 from jdyer@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1720636 ] SOLR-4280 : Allow specifying "spellcheck.maxResultsForSuggest" as a percentage of filter query results
          Hide
          ASF subversion and git services added a comment -

          Commit 1720637 from jdyer@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1720637 ]

          SOLR-4280: Allow specifying "spellcheck.maxResultsForSuggest" as a percentage of filter query results

          Show
          ASF subversion and git services added a comment - Commit 1720637 from jdyer@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1720637 ] SOLR-4280 : Allow specifying "spellcheck.maxResultsForSuggest" as a percentage of filter query results
          Hide
          Markus Jelsma added a comment -

          Great work James! Many thanks!

          Show
          Markus Jelsma added a comment - Great work James! Many thanks!
          Hide
          James Dyer added a comment -

          And thanks to you, Markus, for actually developing the code for this.

          Show
          James Dyer added a comment - And thanks to you, Markus, for actually developing the code for this.

            People

            • Assignee:
              James Dyer
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development