FSTLookup should use long-tail like discretization instead of proportional (linear)

Details

• Type: Improvement
• Status: Closed
• Priority: Minor
• Resolution: Duplicate
• Affects Version/s: 3.4
• Fix Version/s:
• Component/s:
• Labels:
None

Description

The Suggester's FSTLookup implementation discretizes the term frequencies into a configurable number of buckets (configurable as "weightBuckets") in order to deal with FST limitations. The mapping of a source frequency into a bucket is a proportional (i.e. linear) mapping from the minimum and maximum value. I don't think this makes sense at all given the well-known long-tail like distribution of term frequencies. As a result of this problem, I've found it necessary to increase weightBuckets substantially, like >100, to get quality suggestions.

Activity

Hide
Michael McCandless added a comment -

What limitation of FSTs is causing us to discretize the term frequencies?

Show
Michael McCandless added a comment - What limitation of FSTs is causing us to discretize the term frequencies?
Hide
David Smiley added a comment -

Show
Hide
David Smiley added a comment -

It should be noted there are code comments Dawid left on doing another approach:

```    // Distribute weights into at most N buckets. This is a form of discretization to
// limit the number of possible weights so that they can be efficiently encoded in the
// automaton.
//
// It is assumed the distribution of weights is _linear_ so proportional division
// of [min, max] range will be enough here. Other approaches could be to sort
// weights and divide into proportional ranges.
```
Show
David Smiley added a comment - It should be noted there are code comments Dawid left on doing another approach: // Distribute weights into at most N buckets. This is a form of discretization to // limit the number of possible weights so that they can be efficiently encoded in the // automaton. // // It is assumed the distribution of weights is _linear_ so proportional division // of [min, max] range will be enough here. Other approaches could be to sort // weights and divide into proportional ranges.
Hide
Dawid Weiss added a comment -

Let me know if anything is not clear, Mike.

Show
Dawid Weiss added a comment - Let me know if anything is not clear, Mike.
Hide
Michael McCandless added a comment -

David.

I was just wondering what specifically is the limitation on our FST
impl and whether it's something we could improve. It sounds like the
limitation is just how we quantize the incoming weights...

David, when you use > 100 buckets did you see bad performance for
low-weight lookups?

Maybe, in addition to the up-front quantization, we could also store
a more exact weight for each term (eg as the output). Then on
retrieve we could re-sort the candidates by that exact weight. But
this will make the FST larger...

Show
Michael McCandless added a comment - Ooooh, the javadocs and comments are awesome! – thanks Dawid and David. I was just wondering what specifically is the limitation on our FST impl and whether it's something we could improve. It sounds like the limitation is just how we quantize the incoming weights... David, when you use > 100 buckets did you see bad performance for low-weight lookups? Maybe, in addition to the up-front quantization, we could also store a more exact weight for each term (eg as the output). Then on retrieve we could re-sort the candidates by that exact weight. But this will make the FST larger...
Hide
David Smiley added a comment -

David, when you use > 100 buckets did you see bad performance for low-weight lookups?

I didn't try in any serious way. I was simply writing about this feature when I observed the suggestions were poor compared to other Lookup impls and other ways of doing term completion. Then I started digging into why and what could be done about it.

Show
David Smiley added a comment - David, when you use > 100 buckets did you see bad performance for low-weight lookups? I didn't try in any serious way. I was simply writing about this feature when I observed the suggestions were poor compared to other Lookup impls and other ways of doing term completion. Then I started digging into why and what could be done about it.
Hide
Dawid Weiss added a comment -

I guess a lot depends on the use case. In my case quantization was not a problem (the scores were "rough" and query independent anyway, so they did fall into corresponding buckets). "poor" performance would then have to be backed by what the requirement really is – if one needs sorting by exact scores then the method used to speed up FSTLookup simply isn't a good fit. Still, compared to fetching everything and resorting this is a hell of a lot faster, so many folks (including me) may find it helpful.

It all depends, in other words.

As for using more buckets – sure, you can do this. In fact, you can combine both approaches and use quantization to prefetch a buffer of matches, then collect outputs, sort and if this fills your desired number of results then there is no need to search any further because all other buckets will have lower scores (exact).

Show
Dawid Weiss added a comment - I guess a lot depends on the use case. In my case quantization was not a problem (the scores were "rough" and query independent anyway, so they did fall into corresponding buckets). "poor" performance would then have to be backed by what the requirement really is – if one needs sorting by exact scores then the method used to speed up FSTLookup simply isn't a good fit. Still, compared to fetching everything and resorting this is a hell of a lot faster, so many folks (including me) may find it helpful. It all depends, in other words. As for using more buckets – sure, you can do this. In fact, you can combine both approaches and use quantization to prefetch a buffer of matches, then collect outputs, sort and if this fills your desired number of results then there is no need to search any further because all other buckets will have lower scores (exact).
Hide
David Smiley added a comment -

I recommend that we follow through with the alternative suggested in the source code comment: sort by weight and divide evenly. That will handle the actual distribution in the data no matter what the curve looks like.

Show
David Smiley added a comment - I recommend that we follow through with the alternative suggested in the source code comment: sort by weight and divide evenly. That will handle the actual distribution in the data no matter what the curve looks like.
Hide
Dawid Weiss added a comment -

I'll work on it. You may be the first serious user of that code Most people simply use Solr queries to do suggestions it seems (because it provides more flexibility in terms of dynamic scoring).

Show
Dawid Weiss added a comment - I'll work on it. You may be the first serious user of that code Most people simply use Solr queries to do suggestions it seems (because it provides more flexibility in terms of dynamic scoring).
Hide
David Smiley added a comment -

LOL; I am not using it "seriously". I'm merely kicking the tires to see how well it works so I can write about it in the 2nd edition of my book. When you say "Most people use Solr queries to do suggestions it seems", do you mean search query logs? That requires sufficient query volume, and it's more work to set up, for sure, than query term completion/suggest.

Show
David Smiley added a comment - LOL; I am not using it "seriously". I'm merely kicking the tires to see how well it works so I can write about it in the 2nd edition of my book. When you say "Most people use Solr queries to do suggestions it seems", do you mean search query logs? That requires sufficient query volume, and it's more work to set up, for sure, than query term completion/suggest.
Hide
Dawid Weiss added a comment -

Don't laugh, I'm serious I once wanted to collect some feedback about who's using Lookups and how they're using it and it seems people just load queries to Solr and do regular prefix searches over an index (possibly mutating the scoring with a function of geoip proximity or other factors).

Show
Dawid Weiss added a comment - Don't laugh, I'm serious I once wanted to collect some feedback about who's using Lookups and how they're using it and it seems people just load queries to Solr and do regular prefix searches over an index (possibly mutating the scoring with a function of geoip proximity or other factors).
Hide
Dawid Weiss added a comment -

Refactoring of FSTLookup API.

Show
Dawid Weiss added a comment - Refactoring of FSTLookup API.
Hide
Dawid Weiss added a comment -

Brainstorming discussions with Robert and Simon who had real use cases. The outcome is that discretization into buckets will be problematic to "get right" in the general case. The distribution of weight functions may require custom tweaks and tunings that should best be done before weights are added to the FSTLookup. An explicit API of the form add(term, int bucket) will be added, with an adapter over TermFreqIterator to do min/max (value range) or long-tail (sorted input) bucketing. These adapters will be more costly as they may require additional passes over the data or re-sorting of the input data. The add(term, int bucket) will be cheap(er) with only a single sort required.

Show
Dawid Weiss added a comment - Brainstorming discussions with Robert and Simon who had real use cases. The outcome is that discretization into buckets will be problematic to "get right" in the general case. The distribution of weight functions may require custom tweaks and tunings that should best be done before weights are added to the FSTLookup. An explicit API of the form add(term, int bucket) will be added, with an adapter over TermFreqIterator to do min/max (value range) or long-tail (sorted input) bucketing. These adapters will be more costly as they may require additional passes over the data or re-sorting of the input data. The add(term, int bucket) will be cheap(er) with only a single sort required.
Hide
Dawid Weiss added a comment -

Incorporated into SOLR-2888

Show
Dawid Weiss added a comment - Incorporated into SOLR-2888
Hide
Uwe Schindler added a comment -

Bulk close after 3.5 is released

Show
Uwe Schindler added a comment - Bulk close after 3.5 is released
Hide
Dawid Weiss added a comment -

Linking arbitrary score fst-based autocompletion algorithm.

Show
Dawid Weiss added a comment - Linking arbitrary score fst-based autocompletion algorithm.

People

• Assignee:
Dawid Weiss
Reporter:
David Smiley