Lucene - Core
  1. Lucene - Core
  2. LUCENE-3714

add suggester that uses shortest path/wFST instead of buckets

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: modules/spellchecker
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently the FST suggester (really an FSA) quantizes weights into buckets (e.g. single byte) and puts them in front of the word.
      This makes it fast, but you lose granularity in your suggestions.

      Lately the question was raised, if you build lucene's FST with positiveintoutputs, does it behave the same as a tropical semiring wFST?

      In other words, after completing the word, we instead traverse min(output) at each node to find the 'shortest path' to the
      best suggestion (with the highest score).

      This means we wouldnt need to quantize weights at all and it might make some operations (e.g. adding fuzzy matching etc) a lot easier.

      1. LUCENE-3714.patch
        38 kB
        Robert Muir
      2. LUCENE-3714.patch
        28 kB
        Robert Muir
      3. LUCENE-3714.patch
        31 kB
        Robert Muir
      4. LUCENE-3714.patch
        31 kB
        Robert Muir
      5. LUCENE-3714.patch
        18 kB
        Michael McCandless
      6. LUCENE-3714.patch
        7 kB
        Robert Muir
      7. out.png
        63 kB
        Dawid Weiss
      8. TestMe.java
        4 kB
        Dawid Weiss

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          patch that Mike and I came up with that finds the minimal output from an arc, and a random test showing it works.

          Show
          Robert Muir added a comment - patch that Mike and I came up with that finds the minimal output from an arc, and a random test showing it works.
          Hide
          Dawid Weiss added a comment -

          An problematic example where root arcs, when traversed min-to-max collect outputs, but every outgoing arc only collects a single better suggestion (and should skip possibly lots of other suggestions). This is created by the following input:

          aa|N
          ab|1
          ba|N
          bb|2
          ca|N
          cb|3
          ..

          collecting the K-th suggestion with the smallest score will require scanning pessimistically all of the arcs. Note that you can put arbitrarily large subtrees on _a|N nodes like:

          aaa|N
          aab|N
          aac|N

          etc.

          Show
          Dawid Weiss added a comment - An problematic example where root arcs, when traversed min-to-max collect outputs, but every outgoing arc only collects a single better suggestion (and should skip possibly lots of other suggestions). This is created by the following input: aa|N ab|1 ba|N bb|2 ca|N cb|3 .. collecting the K-th suggestion with the smallest score will require scanning pessimistically all of the arcs. Note that you can put arbitrarily large subtrees on _a|N nodes like: aaa|N aab|N aac|N etc.
          Hide
          Dawid Weiss added a comment -

          The patch works because it finds the first (topmost) suggestion, but collecting suggestions with max-N (or min-N) will require a priority queue so that one knows which next arc to follow next (and this will also require storing partially collected paths for pointers in the fst/queue)?

          Show
          Dawid Weiss added a comment - The patch works because it finds the first (topmost) suggestion, but collecting suggestions with max-N (or min-N) will require a priority queue so that one knows which next arc to follow next (and this will also require storing partially collected paths for pointers in the fst/queue)?
          Hide
          Robert Muir added a comment -

          Not sure it requires one, http://www.cs.nyu.edu/~mohri/pub/nbest.ps has some solutions.

          Show
          Robert Muir added a comment - Not sure it requires one, http://www.cs.nyu.edu/~mohri/pub/nbest.ps has some solutions.
          Hide
          Dawid Weiss added a comment -

          I'm sure there are solutions to the problem if you change algebra ops – the pq is a naive solutions that would work on top of positive outputs.

          Show
          Dawid Weiss added a comment - I'm sure there are solutions to the problem if you change algebra ops – the pq is a naive solutions that would work on top of positive outputs.
          Hide
          Robert Muir added a comment -

          Yeah I think we should try that first, and see how it performs.

          Show
          Robert Muir added a comment - Yeah I think we should try that first, and see how it performs.
          Hide
          Michael McCandless added a comment -

          Dawid, by problematic example, you mean you think this approach is functionally correct but may not perform very well...?

          That is definitely the worst-case performance (for either top-1 or top-K on a wFST with simple PQ), but this (number of non-competitive arcs you have to scan and discard) is a constant factor on the overall complexity right?

          I think we should at least test the simple PQ on PositiveIntsOutputs wFST and see how it performs in practice. If indeed having everything "in one bucket" is too slow, we could combine the two approaches: still use buckets, but within each bucket we have a wFST (ie, use the "true" score), so we don't actually do any quantizing in the end results. Then bucketing is purely an optimization...

          Or, maybe, we could keep one bucket but sort each node's arcs by their output instead of by label. This'd mean the initial lookup-by-prefix gets slower (linear scan instead a bin search, assuming those nodes had array'd arcs), but then producing the top-N is very fast (no wasted arcs need to be scanned). Maybe we could keep the by-label sort for nodes within depth N, and then sort by output beyond that...

          Or we could change the outputs algebra so that more "lookahead" is stored in each output so we have more guidance on which arcs are worth pursuing...

          Show
          Michael McCandless added a comment - Dawid, by problematic example, you mean you think this approach is functionally correct but may not perform very well...? That is definitely the worst-case performance (for either top-1 or top-K on a wFST with simple PQ), but this (number of non-competitive arcs you have to scan and discard) is a constant factor on the overall complexity right? I think we should at least test the simple PQ on PositiveIntsOutputs wFST and see how it performs in practice. If indeed having everything "in one bucket" is too slow, we could combine the two approaches: still use buckets, but within each bucket we have a wFST (ie, use the "true" score), so we don't actually do any quantizing in the end results. Then bucketing is purely an optimization... Or, maybe, we could keep one bucket but sort each node's arcs by their output instead of by label. This'd mean the initial lookup-by-prefix gets slower (linear scan instead a bin search, assuming those nodes had array'd arcs), but then producing the top-N is very fast (no wasted arcs need to be scanned). Maybe we could keep the by-label sort for nodes within depth N, and then sort by output beyond that... Or we could change the outputs algebra so that more "lookahead" is stored in each output so we have more guidance on which arcs are worth pursuing...
          Hide
          Dawid Weiss added a comment -

          I thought you had a solution that collects top-N, but your patch selects one (best) matching solution only. I don't know how you planned to go around selecting top-N, but in my understanding (at that moment) top-N selection is not going to work via recursive scan because an output at the given level doesn't tell you much about which arcs to follow.

          I can see how this can be solved by picking the arc/direction with the "next smallest/largest" output among all arcs traversed so far but this will be more complex and I cannot provide any bounds on how large the queue can be or what the worst case lookup then is. I do have a feeling a degenerate example can be devised, but then I also have a feeling these are uncommon in practice.

          Sorting arcs by score doesn't help if you use the pq – you need to add all of them to the pq and then pick the smallest path. In a way it is like what you did, but the pq is maintaining fast access to the next-smaller-cost path. Another feeling is that the PQ can be bound to a maximum size of N? Every arc leads to at least one leaf so while traversing you'd drop those arcs that definitely would have fallen out of the first N smallest/largest weights... Yes, this could work. I'd still try to devise a degenerate example to see what the cost of maintaining the PQ can be.

          Show
          Dawid Weiss added a comment - I thought you had a solution that collects top-N, but your patch selects one (best) matching solution only. I don't know how you planned to go around selecting top-N, but in my understanding (at that moment) top-N selection is not going to work via recursive scan because an output at the given level doesn't tell you much about which arcs to follow. I can see how this can be solved by picking the arc/direction with the "next smallest/largest" output among all arcs traversed so far but this will be more complex and I cannot provide any bounds on how large the queue can be or what the worst case lookup then is. I do have a feeling a degenerate example can be devised, but then I also have a feeling these are uncommon in practice. Sorting arcs by score doesn't help if you use the pq – you need to add all of them to the pq and then pick the smallest path. In a way it is like what you did, but the pq is maintaining fast access to the next-smaller-cost path. Another feeling is that the PQ can be bound to a maximum size of N? Every arc leads to at least one leaf so while traversing you'd drop those arcs that definitely would have fallen out of the first N smallest/largest weights... Yes, this could work. I'd still try to devise a degenerate example to see what the cost of maintaining the PQ can be.
          Hide
          Dawid Weiss added a comment -

          If I seem inconsistent above then it's because I don't have ready-to-use answers and I'm sort of thinking out loud

          Show
          Dawid Weiss added a comment - If I seem inconsistent above then it's because I don't have ready-to-use answers and I'm sort of thinking out loud
          Hide
          Robert Muir added a comment -

          we could combine the two approaches: still use buckets, but within each bucket we have a wFST (ie, use the "true" score), so we don't actually do any quantizing in the end results. Then bucketing is purely an optimization...

          I like this idea!

          Show
          Robert Muir added a comment - we could combine the two approaches: still use buckets, but within each bucket we have a wFST (ie, use the "true" score), so we don't actually do any quantizing in the end results. Then bucketing is purely an optimization... I like this idea!
          Hide
          Dawid Weiss added a comment -

          If my feeling is right and the PQ can be kept constant-size then it won't matter much at runtime I think. With realistic data distributions the number of elements to be inserted into the PQ before you reach the top-N will be pretty much the same . And the benefit would be a much cleaner traversal (no need to deal with buckets, early termination, etc.).

          Show
          Dawid Weiss added a comment - If my feeling is right and the PQ can be kept constant-size then it won't matter much at runtime I think. With realistic data distributions the number of elements to be inserted into the PQ before you reach the top-N will be pretty much the same . And the benefit would be a much cleaner traversal (no need to deal with buckets, early termination, etc.).
          Hide
          Michael McCandless added a comment -

          Patch, generalizing to a topN search from the wFST (not just top 1). I fixed the random test to randomly pick a topN and it's passing!

          I think the cost is something like O(N * L) + O(N * log(N)), where N is the top N and L is the length of each completion. The N * log(N) is because I use a TreeSet (PQ wasn't enough: I needed pollFirst and pollLast... for 3.x that'll have to be 2 method calls...)... but I suspect in practice it won't dominate, since N is typically smaller than L and the constant in front of that is tiny...

          Each path requires a traversal through the FST looking for the arcs with NO_OUTPUT, so if the FST has many non-competitive arcs that will definitely slow it down (by a [possibly large] constant factor). We need to test on a real data set how slow it is....

          Show
          Michael McCandless added a comment - Patch, generalizing to a topN search from the wFST (not just top 1). I fixed the random test to randomly pick a topN and it's passing! I think the cost is something like O(N * L) + O(N * log(N)), where N is the top N and L is the length of each completion. The N * log(N) is because I use a TreeSet (PQ wasn't enough: I needed pollFirst and pollLast... for 3.x that'll have to be 2 method calls...)... but I suspect in practice it won't dominate, since N is typically smaller than L and the constant in front of that is tiny... Each path requires a traversal through the FST looking for the arcs with NO_OUTPUT, so if the FST has many non-competitive arcs that will definitely slow it down (by a [possibly large] constant factor). We need to test on a real data set how slow it is....
          Hide
          Dawid Weiss added a comment -

          an example showing top-n with score and alpha order, pq (unbounded, but going for bounded should be simple).

          Show
          Dawid Weiss added a comment - an example showing top-n with score and alpha order, pq (unbounded, but going for bounded should be simple).
          Hide
          Dawid Weiss added a comment -

          Damn, Again slower than McCandless... I didn't do a bounded queue but it's perfectly possible to do one. My patch just shows the algorithm, didn't check if Mike's version is the same but I suspect it must be close .

          Show
          Dawid Weiss added a comment - Damn, Again slower than McCandless... I didn't do a bounded queue but it's perfectly possible to do one. My patch just shows the algorithm, didn't check if Mike's version is the same but I suspect it must be close .
          Hide
          Michael McCandless added a comment -

          Ooh, your patch is doing the same algo as mine (best-first search)... yours is much simpler/smaller though

          I think your tie break isn't quite right? Because, those two arc.labels you use aren't necessarily "comparable", since the two paths may not be the same length (hmm or share the same "last parent" node)? You need to roll back and do a full compare of the accumulated input labels down that path? I struggled with this too...

          I also took advantage of a neat property that every min/+ wFST will have (Robert pointed this out): the first arc (only) will have the min output, and then there must exist a NO_OUTPUT completion path from that arc to some final node. So, I think my patch will visit paths in the same best-first order as yours, but I avoid the push/pop/new object into the queue for each arc traversed by greedily/recursively pursuing the first NO_OUTPUT arc I can find. I only fall back to the queue when I need to find the next top-N path to pursue...

          Show
          Michael McCandless added a comment - Ooh, your patch is doing the same algo as mine (best-first search)... yours is much simpler/smaller though I think your tie break isn't quite right? Because, those two arc.labels you use aren't necessarily "comparable", since the two paths may not be the same length (hmm or share the same "last parent" node)? You need to roll back and do a full compare of the accumulated input labels down that path? I struggled with this too... I also took advantage of a neat property that every min/+ wFST will have (Robert pointed this out): the first arc (only) will have the min output, and then there must exist a NO_OUTPUT completion path from that arc to some final node. So, I think my patch will visit paths in the same best-first order as yours, but I avoid the push/pop/new object into the queue for each arc traversed by greedily/recursively pursuing the first NO_OUTPUT arc I can find. I only fall back to the queue when I need to find the next top-N path to pursue...
          Hide
          Dawid Weiss added a comment -

          You're correct – I should be comparing full paths so far, not the current label. Otherwise I see we're pretty much the same. I like the lower bound cutoff condition. I vaguely understand the NO_OUTPUT optimization.

          I like it, this indeed is a nice improvement - if somebody wants exact scores, they can have them.

          Show
          Dawid Weiss added a comment - You're correct – I should be comparing full paths so far, not the current label. Otherwise I see we're pretty much the same. I like the lower bound cutoff condition. I vaguely understand the NO_OUTPUT optimization. I like it, this indeed is a nice improvement - if somebody wants exact scores, they can have them.
          Hide
          Robert Muir added a comment -

          patch with a prototype suggester.

          there are many nocommits: especially the encoding of weights is very inefficient and slows/bloats the fst (but easy for a prototype).

          We should make a better Outputs class with the algebra we need (e.g. max) and maybe think about floats in this suggester API (it seems from the API names like TermFreq that these are really frequencies so maybe we should define them as ints? I don't like floats because of all the hassles like NaN/inf)

          But even with this, the performance seems promising:

              [junit] ------------- Standard Error -----------------
              [junit] -- prefixes: 100-200, num: 7, onlyMorePopular: true
              [junit] JaspellLookup   queries: 50001, time[ms]: 91 [+- 6.59], ~qps: 547
              [junit] TSTLookup       queries: 50001, time[ms]: 33 [+- 1.55], ~qps: 1532
              [junit] FSTCompletionLookup queries: 50001, time[ms]: 163 [+- 13.04], ~qps: 307
              [junit] WFSTCompletionLookup queries: 50001, time[ms]: 55 [+- 1.50], ~qps: 910
              [junit] -- construction time
              [junit] JaspellLookup   input: 50001, time[ms]: 23 [+- 0.93]
              [junit] TSTLookup       input: 50001, time[ms]: 31 [+- 1.25]
              [junit] FSTCompletionLookup input: 50001, time[ms]: 72 [+- 1.73]
              [junit] WFSTCompletionLookup input: 50001, time[ms]: 69 [+- 2.50]
              [junit] -- RAM consumption
              [junit] JaspellLookup   size[B]:    7,869,415
              [junit] TSTLookup       size[B]:    7,914,524
              [junit] FSTCompletionLookup size[B]:      466,051
              [junit] WFSTCompletionLookup size[B]:      506,662
              [junit] -- prefixes: 6-9, num: 7, onlyMorePopular: true
              [junit] JaspellLookup   queries: 50001, time[ms]: 129 [+- 1.37], ~qps: 389
              [junit] TSTLookup       queries: 50001, time[ms]: 139 [+- 2.30], ~qps: 361
              [junit] FSTCompletionLookup queries: 50001, time[ms]: 177 [+- 1.62], ~qps: 282
              [junit] WFSTCompletionLookup queries: 50001, time[ms]: 85 [+- 12.74], ~qps: 592
              [junit] -- prefixes: 2-4, num: 7, onlyMorePopular: true
              [junit] JaspellLookup   queries: 50001, time[ms]: 363 [+- 16.75], ~qps: 138
              [junit] TSTLookup       queries: 50001, time[ms]: 1112 [+- 22.57], ~qps: 45
              [junit] FSTCompletionLookup queries: 50001, time[ms]: 140 [+- 3.16], ~qps: 356
              [junit] WFSTCompletionLookup queries: 50001, time[ms]: 242 [+- 4.01], ~qps: 207
              [junit] ------------- ---------------- ---------------
          
          Show
          Robert Muir added a comment - patch with a prototype suggester. there are many nocommits: especially the encoding of weights is very inefficient and slows/bloats the fst (but easy for a prototype). We should make a better Outputs class with the algebra we need (e.g. max) and maybe think about floats in this suggester API (it seems from the API names like TermFreq that these are really frequencies so maybe we should define them as ints? I don't like floats because of all the hassles like NaN/inf) But even with this, the performance seems promising: [junit] ------------- Standard Error ----------------- [junit] -- prefixes: 100-200, num: 7, onlyMorePopular: true [junit] JaspellLookup queries: 50001, time[ms]: 91 [+- 6.59], ~qps: 547 [junit] TSTLookup queries: 50001, time[ms]: 33 [+- 1.55], ~qps: 1532 [junit] FSTCompletionLookup queries: 50001, time[ms]: 163 [+- 13.04], ~qps: 307 [junit] WFSTCompletionLookup queries: 50001, time[ms]: 55 [+- 1.50], ~qps: 910 [junit] -- construction time [junit] JaspellLookup input: 50001, time[ms]: 23 [+- 0.93] [junit] TSTLookup input: 50001, time[ms]: 31 [+- 1.25] [junit] FSTCompletionLookup input: 50001, time[ms]: 72 [+- 1.73] [junit] WFSTCompletionLookup input: 50001, time[ms]: 69 [+- 2.50] [junit] -- RAM consumption [junit] JaspellLookup size[B]: 7,869,415 [junit] TSTLookup size[B]: 7,914,524 [junit] FSTCompletionLookup size[B]: 466,051 [junit] WFSTCompletionLookup size[B]: 506,662 [junit] -- prefixes: 6-9, num: 7, onlyMorePopular: true [junit] JaspellLookup queries: 50001, time[ms]: 129 [+- 1.37], ~qps: 389 [junit] TSTLookup queries: 50001, time[ms]: 139 [+- 2.30], ~qps: 361 [junit] FSTCompletionLookup queries: 50001, time[ms]: 177 [+- 1.62], ~qps: 282 [junit] WFSTCompletionLookup queries: 50001, time[ms]: 85 [+- 12.74], ~qps: 592 [junit] -- prefixes: 2-4, num: 7, onlyMorePopular: true [junit] JaspellLookup queries: 50001, time[ms]: 363 [+- 16.75], ~qps: 138 [junit] TSTLookup queries: 50001, time[ms]: 1112 [+- 22.57], ~qps: 45 [junit] FSTCompletionLookup queries: 50001, time[ms]: 140 [+- 3.16], ~qps: 356 [junit] WFSTCompletionLookup queries: 50001, time[ms]: 242 [+- 4.01], ~qps: 207 [junit] ------------- ---------------- ---------------
          Hide
          Dawid Weiss added a comment -

          Didn't look at the patch yet but this looks surprising:

          FSTCompletionLookup queries: 50001, time[ms]: 163 [+- 13.04], ~qps: 307
          WFSTCompletionLookup queries: 50001, time[ms]: 55 [+- 1.50], ~qps: 910
          

          Why would the previous version be so much slower? With onlyMorePopular=true this should be a single pass. Unless it's setup to bring exact match to the top and then there's a need to check multiple buckets with lower scores – would that be it?

          Show
          Dawid Weiss added a comment - Didn't look at the patch yet but this looks surprising: FSTCompletionLookup queries: 50001, time[ms]: 163 [+- 13.04], ~qps: 307 WFSTCompletionLookup queries: 50001, time[ms]: 55 [+- 1.50], ~qps: 910 Why would the previous version be so much slower? With onlyMorePopular=true this should be a single pass. Unless it's setup to bring exact match to the top and then there's a need to check multiple buckets with lower scores – would that be it?
          Hide
          Robert Muir added a comment -

          Thats the exact match case of the benchmark where it benchmarks the entire word being typed in completely?

          I didn't look at it because its not very interesting (TSTLookup looks super-fast here too)

          I think this is the interesting part:

              [junit] -- prefixes: 2-4, num: 7, onlyMorePopular: true
              [junit] JaspellLookup   queries: 50001, time[ms]: 363 [+- 16.75], ~qps: 138
              [junit] TSTLookup       queries: 50001, time[ms]: 1112 [+- 22.57], ~qps: 45
              [junit] FSTCompletionLookup queries: 50001, time[ms]: 140 [+- 3.16], ~qps: 356
              [junit] WFSTCompletionLookup queries: 50001, time[ms]: 242 [+- 4.01], ~qps: 207
          

          We could of course toggle 'num' etc and see how this comes out. Anyway I just wanted
          to make sure the performance was 'in the ballpark' and competitive with the other
          suggesters, I think for a lot of users thats all that matters, and whether its
          400 QPS or 200QPS, probably doesnt matter so much... so I didnt tweak any further.

          Show
          Robert Muir added a comment - Thats the exact match case of the benchmark where it benchmarks the entire word being typed in completely? I didn't look at it because its not very interesting (TSTLookup looks super-fast here too) I think this is the interesting part: [junit] -- prefixes: 2-4, num: 7, onlyMorePopular: true [junit] JaspellLookup queries: 50001, time[ms]: 363 [+- 16.75], ~qps: 138 [junit] TSTLookup queries: 50001, time[ms]: 1112 [+- 22.57], ~qps: 45 [junit] FSTCompletionLookup queries: 50001, time[ms]: 140 [+- 3.16], ~qps: 356 [junit] WFSTCompletionLookup queries: 50001, time[ms]: 242 [+- 4.01], ~qps: 207 We could of course toggle 'num' etc and see how this comes out. Anyway I just wanted to make sure the performance was 'in the ballpark' and competitive with the other suggesters, I think for a lot of users thats all that matters, and whether its 400 QPS or 200QPS, probably doesnt matter so much... so I didnt tweak any further.
          Hide
          Michael McCandless added a comment -

          Awesome! I haven't looked at patch yet... but the comparison is also apples/oranges right?

          Because WFSTCompletionLookup does no quantizing (bucketing), ie, it returns the "true" topN suggestions, so it has to do more work to differentiate hits that FSTCompletionLookup considers equal.

          I guess we could pre-quantize the weights into the same buckets that FSTCompletionLookup will use, when adding to WFSTCompletionLookup... then the results are comparable.

          But I guess we don't need to do this since the results are "good enough"...

          Show
          Michael McCandless added a comment - Awesome! I haven't looked at patch yet... but the comparison is also apples/oranges right? Because WFSTCompletionLookup does no quantizing (bucketing), ie, it returns the "true" topN suggestions, so it has to do more work to differentiate hits that FSTCompletionLookup considers equal. I guess we could pre-quantize the weights into the same buckets that FSTCompletionLookup will use, when adding to WFSTCompletionLookup... then the results are comparable. But I guess we don't need to do this since the results are "good enough"...
          Hide
          Dawid Weiss added a comment -

          No, no, Mike – I'm all for going with exact, fine scores. That way even if WFSTCompletionLookup is slower it's still a real-life use case and FSTCompletionLookup (even if faster) will have a saying it's not a complete solution.

          I like what Robert did (still didn't look at the patch), but I was just wondering why the hell difference for long prefixes. This is not a frequent case in real-life, just curiosity.

          Show
          Dawid Weiss added a comment - No, no, Mike – I'm all for going with exact, fine scores. That way even if WFSTCompletionLookup is slower it's still a real-life use case and FSTCompletionLookup (even if faster) will have a saying it's not a complete solution. I like what Robert did (still didn't look at the patch), but I was just wondering why the hell difference for long prefixes. This is not a frequent case in real-life, just curiosity.
          Hide
          Michael McCandless added a comment -

          Sorry, I'm not proposing we commit pre-quantizing the scores... I'm
          just saying we'd learn more from the perf test that way. Ie, is WFST
          slower because 1) it's doing precise scores, or 2) the topN algo is
          slowish.

          Show
          Michael McCandless added a comment - Sorry, I'm not proposing we commit pre-quantizing the scores... I'm just saying we'd learn more from the perf test that way. Ie, is WFST slower because 1) it's doing precise scores, or 2) the topN algo is slowish.
          Hide
          Robert Muir added a comment -

          I think the (existing) benchmark is fair: it just measures how fast each suggester returns the top-N.
          Of course FSTSuggester buckets/early terminates and thats just a tradeoff it makes (having some impact on scores, possibly even positive).

          This is just like benchmarking different relevance ranking algorithms, which also compute the top-N differently...

          Keep in mind, I didnt optimize the implementation at all or profile anything. So there could be easy wins here.
          But at this stage I think we just wanted to know its in the ballpark?

          Show
          Robert Muir added a comment - I think the (existing) benchmark is fair: it just measures how fast each suggester returns the top-N. Of course FSTSuggester buckets/early terminates and thats just a tradeoff it makes (having some impact on scores, possibly even positive). This is just like benchmarking different relevance ranking algorithms, which also compute the top-N differently... Keep in mind, I didnt optimize the implementation at all or profile anything. So there could be easy wins here. But at this stage I think we just wanted to know its in the ballpark?
          Hide
          Dawid Weiss added a comment -

          Looked through the patch. Some comments:

          + // nocommit: why does the benchmark test supply duplicates? what should we do in this case?

          ignore duplicates. I think it is allowed for the iterator to allow duplicates; I'm not sure but this may even be used when bucketing input – the same entry with a different score (before bucketing) may end up identical after sorting.

          + // nocommit: shouldn't we have an option to
          + // match exactly even if its a crappy suggestion? we would just look
          + // for arc.isFinal here etc...

          Yes, this should definitely be an option because if it's an exact match then you'll probably want it on top of the suggestions list, no matter what.

          You could also add a generator of the "bad case" that I attached inside TestMe.java – this creates the case when following greedily doesn't yield correct output (requires a pq).

          I also checked the benchmark and yes, it uses exactMatchFirst promotion. This clarifies the speed difference for longer prefixes – not enough results are collected in any of the buckets so no early termination occurs and all buckets must be traversed.

          I like WFSTCompletionLookup very much, clean and simple.

          Show
          Dawid Weiss added a comment - Looked through the patch. Some comments: + // nocommit: why does the benchmark test supply duplicates? what should we do in this case? ignore duplicates. I think it is allowed for the iterator to allow duplicates; I'm not sure but this may even be used when bucketing input – the same entry with a different score (before bucketing) may end up identical after sorting. + // nocommit: shouldn't we have an option to + // match exactly even if its a crappy suggestion? we would just look + // for arc.isFinal here etc... Yes, this should definitely be an option because if it's an exact match then you'll probably want it on top of the suggestions list, no matter what. You could also add a generator of the "bad case" that I attached inside TestMe.java – this creates the case when following greedily doesn't yield correct output (requires a pq). I also checked the benchmark and yes, it uses exactMatchFirst promotion. This clarifies the speed difference for longer prefixes – not enough results are collected in any of the buckets so no early termination occurs and all buckets must be traversed. I like WFSTCompletionLookup very much, clean and simple.
          Hide
          Robert Muir added a comment -

          Dawid thanks for the comments: we can remove that first nocommit and then add an option
          for the second one...

          I think as a step to move this forward we have to fix the output encoding to
          not be (Integer.MAX_VALUE-weight).

          Seems like the best first step is to generify findMinPairs to T and to allow an arbitrary Comparator
          so that we can muck around with the algebra? I looked at this and it seems possible...

          Show
          Robert Muir added a comment - Dawid thanks for the comments: we can remove that first nocommit and then add an option for the second one... I think as a step to move this forward we have to fix the output encoding to not be (Integer.MAX_VALUE-weight). Seems like the best first step is to generify findMinPairs to T and to allow an arbitrary Comparator so that we can muck around with the algebra? I looked at this and it seems possible...
          Hide
          Robert Muir added a comment -

          also I'm not sure I'm in love with "findMinPairs". Maybe we should call it "shortestPaths" ?

          Show
          Robert Muir added a comment - also I'm not sure I'm in love with "findMinPairs". Maybe we should call it "shortestPaths" ?
          Hide
          Michael McCandless added a comment -

          I think the (existing) benchmark is fair: it just measures how fast each suggester returns the top-N.
          Of course FSTSuggester buckets/early terminates and thats just a tradeoff it makes (having some impact on scores, possibly even positive).

          OK, I agree with that: it is a meaningful black-box comparison of two suggester impls.

          also I'm not sure I'm in love with "findMinPairs". Maybe we should call it "shortestPaths" ?

          +1

          Show
          Michael McCandless added a comment - I think the (existing) benchmark is fair: it just measures how fast each suggester returns the top-N. Of course FSTSuggester buckets/early terminates and thats just a tradeoff it makes (having some impact on scores, possibly even positive). OK, I agree with that: it is a meaningful black-box comparison of two suggester impls. also I'm not sure I'm in love with "findMinPairs". Maybe we should call it "shortestPaths" ? +1
          Hide
          Robert Muir added a comment -

          updated patch nuking a couple nocommits: added the boolean exactMatchFirst (default=on like FSTSuggester), and removed the deduping nocommit.

          So the main remaining things are:

          • fix the encoding of the weights
          • use the offline sort
          • figure out what onlyMorePopular means for suggesters and what we should do about it
          Show
          Robert Muir added a comment - updated patch nuking a couple nocommits: added the boolean exactMatchFirst (default=on like FSTSuggester), and removed the deduping nocommit. So the main remaining things are: fix the encoding of the weights use the offline sort figure out what onlyMorePopular means for suggesters and what we should do about it
          Hide
          Robert Muir added a comment -

          I've been wanting to work on this.. havent found the time.

          This just syncs the patch up to trunk's FST api changes.

          Show
          Robert Muir added a comment - I've been wanting to work on this.. havent found the time. This just syncs the patch up to trunk's FST api changes.
          Hide
          Robert Muir added a comment -

          I played with a lot of variations on this patch:

          • generified the shortest path method to take things like Comparators/Comparable
          • tried different algebra/representations with that

          I think we should keep the code wired to Long for now. according to the benchmark
          any generification is like a 5-10% overall perf hit, and I don't see a need for
          anything but Long.

          I think as far as representation, we need to integrate the offline sort, find min/max
          float values and scale to <precision> space, e.g. if precision is 32 then
          Integer.MAX_VALUE - scaledWeight.

          I tried different representations and they just add more complexity (e.g. negative outputs),
          without saving much space at all. This patch uses Integer precision and is only 10% larger
          than the previous impl.

          We don't even need precision to be configurable really,
          we could wire it to integers as a start. But maybe later someone could specify it, e.g.
          if they specified 8 then they basically get the same result as bucketed algorithm today...

          Show
          Robert Muir added a comment - I played with a lot of variations on this patch: generified the shortest path method to take things like Comparators/Comparable tried different algebra/representations with that I think we should keep the code wired to Long for now. according to the benchmark any generification is like a 5-10% overall perf hit, and I don't see a need for anything but Long. I think as far as representation, we need to integrate the offline sort, find min/max float values and scale to <precision> space, e.g. if precision is 32 then Integer.MAX_VALUE - scaledWeight. I tried different representations and they just add more complexity (e.g. negative outputs), without saving much space at all. This patch uses Integer precision and is only 10% larger than the previous impl. We don't even need precision to be configurable really, we could wire it to integers as a start. But maybe later someone could specify it, e.g. if they specified 8 then they basically get the same result as bucketed algorithm today...
          Hide
          Robert Muir added a comment -

          Rounded out this patch: fixed it to use the offline sorting, added a random test (basically a port of the FST shortest-path test, but uses the suggester instead), and added a solr factory and some docs.

          I think its ready to go.

          Show
          Robert Muir added a comment - Rounded out this patch: fixed it to use the offline sorting, added a random test (basically a port of the FST shortest-path test, but uses the suggester instead), and added a solr factory and some docs. I think its ready to go.
          Hide
          Robert Muir added a comment -

          I'll commit this soon if there are no objections.

          I'll mark it experimental for now when committing.

          Show
          Robert Muir added a comment - I'll commit this soon if there are no objections. I'll mark it experimental for now when committing.
          Hide
          Robert Muir added a comment -

          Committed to trunk: working on the backport now.

          I'm gonna measure any perf difference from the pollFirst/pollLast stuff... if there is a slowdown
          we can deal with that on a separate issue (maybe there is some way to prevent a slowdown, at least
          if java 6 is actually whats being used)... but maybe it doesn't affect overall suggest speed anyway.

          Show
          Robert Muir added a comment - Committed to trunk: working on the backport now. I'm gonna measure any perf difference from the pollFirst/pollLast stuff... if there is a slowdown we can deal with that on a separate issue (maybe there is some way to prevent a slowdown, at least if java 6 is actually whats being used)... but maybe it doesn't affect overall suggest speed anyway.
          Hide
          Robert Muir added a comment -

          I ran the benchmarks, no measurable difference, makes it easy. Will commit
          the backport soon after ant test/javadocs with java5 are finished.

          Show
          Robert Muir added a comment - I ran the benchmarks, no measurable difference, makes it easy. Will commit the backport soon after ant test/javadocs with java5 are finished.
          Hide
          Robert Muir added a comment -

          Backported to 3.x too, I think we can iterate from here with other issues!

          Show
          Robert Muir added a comment - Backported to 3.x too, I think we can iterate from here with other issues!

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development