Solr
  1. Solr
  2. SOLR-2761

FSTLookup should use long-tail like discretization instead of proportional (linear)

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Affects Version/s: 3.4
    • Fix Version/s: 3.5, 3.6, 4.0-ALPHA
    • Component/s: spellchecker
    • Labels:
      None

      Description

      The Suggester's FSTLookup implementation discretizes the term frequencies into a configurable number of buckets (configurable as "weightBuckets") in order to deal with FST limitations. The mapping of a source frequency into a bucket is a proportional (i.e. linear) mapping from the minimum and maximum value. I don't think this makes sense at all given the well-known long-tail like distribution of term frequencies. As a result of this problem, I've found it necessary to increase weightBuckets substantially, like >100, to get quality suggestions.

        Issue Links

          Activity

          Hide
          Dawid Weiss added a comment -

          Linking arbitrary score fst-based autocompletion algorithm.

          Show
          Dawid Weiss added a comment - Linking arbitrary score fst-based autocompletion algorithm.
          Hide
          Uwe Schindler added a comment -

          Bulk close after 3.5 is released

          Show
          Uwe Schindler added a comment - Bulk close after 3.5 is released
          Hide
          Dawid Weiss added a comment -

          Incorporated into SOLR-2888

          Show
          Dawid Weiss added a comment - Incorporated into SOLR-2888
          Hide
          Dawid Weiss added a comment -

          Brainstorming discussions with Robert and Simon who had real use cases. The outcome is that discretization into buckets will be problematic to "get right" in the general case. The distribution of weight functions may require custom tweaks and tunings that should best be done before weights are added to the FSTLookup. An explicit API of the form add(term, int bucket) will be added, with an adapter over TermFreqIterator to do min/max (value range) or long-tail (sorted input) bucketing. These adapters will be more costly as they may require additional passes over the data or re-sorting of the input data. The add(term, int bucket) will be cheap(er) with only a single sort required.

          Show
          Dawid Weiss added a comment - Brainstorming discussions with Robert and Simon who had real use cases. The outcome is that discretization into buckets will be problematic to "get right" in the general case. The distribution of weight functions may require custom tweaks and tunings that should best be done before weights are added to the FSTLookup. An explicit API of the form add(term, int bucket) will be added, with an adapter over TermFreqIterator to do min/max (value range) or long-tail (sorted input) bucketing. These adapters will be more costly as they may require additional passes over the data or re-sorting of the input data. The add(term, int bucket) will be cheap(er) with only a single sort required.
          Hide
          Dawid Weiss added a comment -

          Refactoring of FSTLookup API.

          Show
          Dawid Weiss added a comment - Refactoring of FSTLookup API.
          Hide
          Dawid Weiss added a comment -

          Don't laugh, I'm serious I once wanted to collect some feedback about who's using Lookups and how they're using it and it seems people just load queries to Solr and do regular prefix searches over an index (possibly mutating the scoring with a function of geoip proximity or other factors).

          Show
          Dawid Weiss added a comment - Don't laugh, I'm serious I once wanted to collect some feedback about who's using Lookups and how they're using it and it seems people just load queries to Solr and do regular prefix searches over an index (possibly mutating the scoring with a function of geoip proximity or other factors).
          Hide
          David Smiley added a comment -

          LOL; I am not using it "seriously". I'm merely kicking the tires to see how well it works so I can write about it in the 2nd edition of my book. When you say "Most people use Solr queries to do suggestions it seems", do you mean search query logs? That requires sufficient query volume, and it's more work to set up, for sure, than query term completion/suggest.

          Show
          David Smiley added a comment - LOL; I am not using it "seriously". I'm merely kicking the tires to see how well it works so I can write about it in the 2nd edition of my book. When you say "Most people use Solr queries to do suggestions it seems", do you mean search query logs? That requires sufficient query volume, and it's more work to set up, for sure, than query term completion/suggest.
          Hide
          Dawid Weiss added a comment -

          I'll work on it. You may be the first serious user of that code Most people simply use Solr queries to do suggestions it seems (because it provides more flexibility in terms of dynamic scoring).

          Show
          Dawid Weiss added a comment - I'll work on it. You may be the first serious user of that code Most people simply use Solr queries to do suggestions it seems (because it provides more flexibility in terms of dynamic scoring).
          Hide
          David Smiley added a comment -

          I recommend that we follow through with the alternative suggested in the source code comment: sort by weight and divide evenly. That will handle the actual distribution in the data no matter what the curve looks like.

          Show
          David Smiley added a comment - I recommend that we follow through with the alternative suggested in the source code comment: sort by weight and divide evenly. That will handle the actual distribution in the data no matter what the curve looks like.
          Hide
          Dawid Weiss added a comment -

          I guess a lot depends on the use case. In my case quantization was not a problem (the scores were "rough" and query independent anyway, so they did fall into corresponding buckets). "poor" performance would then have to be backed by what the requirement really is – if one needs sorting by exact scores then the method used to speed up FSTLookup simply isn't a good fit. Still, compared to fetching everything and resorting this is a hell of a lot faster, so many folks (including me) may find it helpful.

          It all depends, in other words.

          As for using more buckets – sure, you can do this. In fact, you can combine both approaches and use quantization to prefetch a buffer of matches, then collect outputs, sort and if this fills your desired number of results then there is no need to search any further because all other buckets will have lower scores (exact).

          Show
          Dawid Weiss added a comment - I guess a lot depends on the use case. In my case quantization was not a problem (the scores were "rough" and query independent anyway, so they did fall into corresponding buckets). "poor" performance would then have to be backed by what the requirement really is – if one needs sorting by exact scores then the method used to speed up FSTLookup simply isn't a good fit. Still, compared to fetching everything and resorting this is a hell of a lot faster, so many folks (including me) may find it helpful. It all depends, in other words. As for using more buckets – sure, you can do this. In fact, you can combine both approaches and use quantization to prefetch a buffer of matches, then collect outputs, sort and if this fills your desired number of results then there is no need to search any further because all other buckets will have lower scores (exact).
          Hide
          David Smiley added a comment -

          David, when you use > 100 buckets did you see bad performance for low-weight lookups?

          I didn't try in any serious way. I was simply writing about this feature when I observed the suggestions were poor compared to other Lookup impls and other ways of doing term completion. Then I started digging into why and what could be done about it.

          Show
          David Smiley added a comment - David, when you use > 100 buckets did you see bad performance for low-weight lookups? I didn't try in any serious way. I was simply writing about this feature when I observed the suggestions were poor compared to other Lookup impls and other ways of doing term completion. Then I started digging into why and what could be done about it.
          Hide
          Michael McCandless added a comment -

          Ooooh, the javadocs and comments are awesome! – thanks Dawid and
          David.

          I was just wondering what specifically is the limitation on our FST
          impl and whether it's something we could improve. It sounds like the
          limitation is just how we quantize the incoming weights...

          David, when you use > 100 buckets did you see bad performance for
          low-weight lookups?

          Maybe, in addition to the up-front quantization, we could also store
          a more exact weight for each term (eg as the output). Then on
          retrieve we could re-sort the candidates by that exact weight. But
          this will make the FST larger...

          Show
          Michael McCandless added a comment - Ooooh, the javadocs and comments are awesome! – thanks Dawid and David. I was just wondering what specifically is the limitation on our FST impl and whether it's something we could improve. It sounds like the limitation is just how we quantize the incoming weights... David, when you use > 100 buckets did you see bad performance for low-weight lookups? Maybe, in addition to the up-front quantization, we could also store a more exact weight for each term (eg as the output). Then on retrieve we could re-sort the candidates by that exact weight. But this will make the FST larger...
          Hide
          Dawid Weiss added a comment -

          Let me know if anything is not clear, Mike.

          Show
          Dawid Weiss added a comment - Let me know if anything is not clear, Mike.
          Hide
          David Smiley added a comment -

          It should be noted there are code comments Dawid left on doing another approach:

              // Distribute weights into at most N buckets. This is a form of discretization to
              // limit the number of possible weights so that they can be efficiently encoded in the
              // automaton.
              //
              // It is assumed the distribution of weights is _linear_ so proportional division 
              // of [min, max] range will be enough here. Other approaches could be to sort 
              // weights and divide into proportional ranges.
          
          Show
          David Smiley added a comment - It should be noted there are code comments Dawid left on doing another approach: // Distribute weights into at most N buckets. This is a form of discretization to // limit the number of possible weights so that they can be efficiently encoded in the // automaton. // // It is assumed the distribution of weights is _linear_ so proportional division // of [min, max] range will be enough here. Other approaches could be to sort // weights and divide into proportional ranges.
          Hide
          David Smiley added a comment -

          FSTLookup is well documented, thanks to Dawid. Here is a link to the Javadocs for your convenience, Mike: https://builds.apache.org/job/Lucene-3.x/javadoc/all/org/apache/lucene/search/suggest/fst/FSTLookup.html?is-external=true

          Show
          David Smiley added a comment - FSTLookup is well documented, thanks to Dawid. Here is a link to the Javadocs for your convenience, Mike: https://builds.apache.org/job/Lucene-3.x/javadoc/all/org/apache/lucene/search/suggest/fst/FSTLookup.html?is-external=true
          Hide
          Michael McCandless added a comment -

          What limitation of FSTs is causing us to discretize the term frequencies?

          Show
          Michael McCandless added a comment - What limitation of FSTs is causing us to discretize the term frequencies?

            People

            • Assignee:
              Dawid Weiss
              Reporter:
              David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development