Solr
  1. Solr
  2. SOLR-3240

add spellcheck 'approximate collation count' mode

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4, 6.0
    • Component/s: spellchecker
    • Labels:
      None

      Description

      SpellCheck's Collation in Solr is a way to ensure spellcheck/suggestions
      will actually net results (taking into account context like filtering).

      In order to do this (from my understanding), it generates candidate queries,
      executes them, and saves the total hit count: collation.setHits(hits).

      For a large index it seems this might be doing too much work: in particular
      I'm interested in ensuring this feature can work fast enough/well for autosuggesters.

      So I think we should offer an 'approximate' mode that uses an early-terminating
      Collector, collect()ing only N docs (e.g. n=1), and we approximate this result
      count based on docid space.

      I'm not sure what needs to happen on the solr side (possibly support for custom collectors?),
      but I think this could help and should possibly be the default.

      1. SOLR-3240.patch
        27 kB
        James Dyer
      2. SOLR-3240.patch
        27 kB
        James Dyer
      3. SOLR-3240.patch
        27 kB
        James Dyer

        Issue Links

          Activity

          Hide
          James Dyer added a comment -

          Are you saying that if a user only cares that a collation will yield some hits, but doesn't care how many, then we can short-circuit these queries to quit once one document is collected? (alternatively, quit after n docs are collected is the user doesn't care if it is "greater than n" ?)

          Show
          James Dyer added a comment - Are you saying that if a user only cares that a collation will yield some hits, but doesn't care how many, then we can short-circuit these queries to quit once one document is collected? (alternatively, quit after n docs are collected is the user doesn't care if it is "greater than n" ?)
          Hide
          Robert Muir added a comment -

          Yes, but I'm saying that we can also still approximate the hit count.

          for example, for n=1, if you have 20,000 docs, and the first docid is '100', we estimate there are 200 matching docs.
          you can increase n (max # of collected docs), to increase the accuracy at the cost of performance.
          currently n=infinity and its always exact

          James can you tell me how collation.hits is used? Does collation use this directly as a heuristic for re-ranking suggestions?
          Or is it only metadata supplied to the user.

          The idea here is that exact numbers are probably not needed for most use cases: they would probably rather have
          inexact hit counts but faster performance.

          Show
          Robert Muir added a comment - Yes, but I'm saying that we can also still approximate the hit count. for example, for n=1, if you have 20,000 docs, and the first docid is '100', we estimate there are 200 matching docs. you can increase n (max # of collected docs), to increase the accuracy at the cost of performance. currently n=infinity and its always exact James can you tell me how collation.hits is used? Does collation use this directly as a heuristic for re-ranking suggestions? Or is it only metadata supplied to the user. The idea here is that exact numbers are probably not needed for most use cases: they would probably rather have inexact hit counts but faster performance.
          Hide
          James Dyer added a comment -

          collation.hits is just metadata for the user, so I think what you want to do would be entirely valid.

          The estimates would only be good if the hits are somewhat evenly distributed across the index, right? For instance, if you're indexing something by topic and all and then a bunch of new docs get added on the same topic around the same time, you'd get a cluster of hits in one place.

          Even so, like you say, many (most) people would rather improve performance than have an accurate (any) hit count returned.

          Beyond this, there are also some dead-simple optimizations we can make by simply removing any sorting & boosting parameters from the query before testing the collation.

          Show
          James Dyer added a comment - collation.hits is just metadata for the user, so I think what you want to do would be entirely valid. The estimates would only be good if the hits are somewhat evenly distributed across the index, right? For instance, if you're indexing something by topic and all and then a bunch of new docs get added on the same topic around the same time, you'd get a cluster of hits in one place. Even so, like you say, many (most) people would rather improve performance than have an accurate (any) hit count returned. Beyond this, there are also some dead-simple optimizations we can make by simply removing any sorting & boosting parameters from the query before testing the collation.
          Hide
          Robert Muir added a comment -

          Beyond this, there are also some dead-simple optimizations we can make by simply removing any sorting & boosting parameters from the query before testing the collation.

          Right, as a custom collector we effectively get this too though, we wouldnt be sorting or scoring anything.

          Show
          Robert Muir added a comment - Beyond this, there are also some dead-simple optimizations we can make by simply removing any sorting & boosting parameters from the query before testing the collation. Right, as a custom collector we effectively get this too though, we wouldnt be sorting or scoring anything.
          Hide
          James Dyer added a comment -

          Here's a patch for this one.

          Robert, is this something like what you had in mind when you opened this issue?

          Show
          James Dyer added a comment - Here's a patch for this one. Robert, is this something like what you had in mind when you opened this issue?
          Hide
          Robert Muir added a comment -

          I think so!

          We could optimize it further by ensuring that we arent collecting scores or anything (e.g. i think we should be wrapping something like a TotalHitCollector (http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollector.java?view=markup), or just not wrapping any collector at all?

          But this patch is probably a good improvement for the worst case.

          Show
          Robert Muir added a comment - I think so! We could optimize it further by ensuring that we arent collecting scores or anything (e.g. i think we should be wrapping something like a TotalHitCollector ( http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollector.java?view=markup ), or just not wrapping any collector at all? But this patch is probably a good improvement for the worst case.
          Hide
          James Dyer added a comment -

          Ok. I think I have a version here that will never compute scores, without having to write a lot of special code for it.

          Best I can tell, when "collateMaxCollectDocs" is 0 or not specified, it will use the first inner-class Collector in SolrIndexSearcher#getDocListNC (this one is almost identical to TotalHitCountCollector). Otherwise, it will use OneComparatorNonScoringCollector with the sort being on "<id>". These queries will also make use of the Solr filter cache & query result cache when they can, etc.

          The one thing is that the unit tests make it easy to determine if it is giving the estimate you'd expect, etc. What I can't so easily test is if I turn off hit reporting entirely (collateExtendedResults=false), is it still picking a non-scoring collector. I would like to add a test that does this but not so sure what the least-invasive approach would be.

          I'm also thinking I can safely get rid of the "forceInorderCollection" flag because requesting docs sorted by doc-id would enforce the same thing, right?

          Show
          James Dyer added a comment - Ok. I think I have a version here that will never compute scores, without having to write a lot of special code for it. Best I can tell, when "collateMaxCollectDocs" is 0 or not specified, it will use the first inner-class Collector in SolrIndexSearcher#getDocListNC (this one is almost identical to TotalHitCountCollector). Otherwise, it will use OneComparatorNonScoringCollector with the sort being on "<id>". These queries will also make use of the Solr filter cache & query result cache when they can, etc. The one thing is that the unit tests make it easy to determine if it is giving the estimate you'd expect, etc. What I can't so easily test is if I turn off hit reporting entirely (collateExtendedResults=false), is it still picking a non-scoring collector. I would like to add a test that does this but not so sure what the least-invasive approach would be. I'm also thinking I can safely get rid of the "forceInorderCollection" flag because requesting docs sorted by doc-id would enforce the same thing, right?
          Hide
          Nalini Kartha added a comment -

          Any timeline on when this would go in? It'd be useful for an extra option we're trying to add to the DirectSpellChecker - we want to issue queries for each suggestion to check that they would return some hits taking into account the fq params of the main query. Since we only care about the suggestion returning at least 1 hit, looks like this Collector would improve performance a lot.

          Show
          Nalini Kartha added a comment - Any timeline on when this would go in? It'd be useful for an extra option we're trying to add to the DirectSpellChecker - we want to issue queries for each suggestion to check that they would return some hits taking into account the fq params of the main query. Since we only care about the suggestion returning at least 1 hit, looks like this Collector would improve performance a lot.
          Hide
          James Dyer added a comment -

          Here is an updated patch for Trunk which I plan to commit soon.

          Note that this patch's EarlyTerminatingCollector is similar to the recently-added (LUCENE-4858) EarlyTerminatingSortingCollector. However, there seems to be enough differences that I did not attempt to combine the two. I have E.T.C. in a ".solr" package, but possibly this showuld be on a ".lucene" package instead?

          Any review or comments are appreciated.

          Show
          James Dyer added a comment - Here is an updated patch for Trunk which I plan to commit soon. Note that this patch's EarlyTerminatingCollector is similar to the recently-added ( LUCENE-4858 ) EarlyTerminatingSortingCollector. However, there seems to be enough differences that I did not attempt to combine the two. I have E.T.C. in a ".solr" package, but possibly this showuld be on a ".lucene" package instead? Any review or comments are appreciated.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] jdyer
          http://svn.apache.org/viewvc?view=revision&revision=1479638

          SOLR-3240: add "spellcheck.collateMaxCollectDocs" for estimating collation hit-counts.

          Show
          Commit Tag Bot added a comment - [trunk commit] jdyer http://svn.apache.org/viewvc?view=revision&revision=1479638 SOLR-3240 : add "spellcheck.collateMaxCollectDocs" for estimating collation hit-counts.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] jdyer
          http://svn.apache.org/viewvc?view=revision&revision=1479644

          SOLR-3240: add "spellcheck.collateMaxCollectDocs" for estimating collation hit-counts.

          Show
          Commit Tag Bot added a comment - [branch_4x commit] jdyer http://svn.apache.org/viewvc?view=revision&revision=1479644 SOLR-3240 : add "spellcheck.collateMaxCollectDocs" for estimating collation hit-counts.
          Hide
          Robert Muir added a comment -

          Thanks for taking care James: nice work

          Show
          Robert Muir added a comment - Thanks for taking care James: nice work
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] jdyer
          http://svn.apache.org/viewvc?view=revision&revision=1479645

          SOLR-3240: add "spellcheck.collateMaxCollectDocs" (removing dead code).

          Show
          Commit Tag Bot added a comment - [trunk commit] jdyer http://svn.apache.org/viewvc?view=revision&revision=1479645 SOLR-3240 : add "spellcheck.collateMaxCollectDocs" (removing dead code).
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] jdyer
          http://svn.apache.org/viewvc?view=revision&revision=1479647

          SOLR-3240: add "spellcheck.collateMaxCollectDocs" (removing dead code).

          Show
          Commit Tag Bot added a comment - [branch_4x commit] jdyer http://svn.apache.org/viewvc?view=revision&revision=1479647 SOLR-3240 : add "spellcheck.collateMaxCollectDocs" (removing dead code).
          Hide
          Steve Rowe added a comment -

          Bulk close resolved 4.4 issues

          Show
          Steve Rowe added a comment - Bulk close resolved 4.4 issues

            People

            • Assignee:
              James Dyer
              Reporter:
              Robert Muir
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development