Lucene - Core
  1. Lucene - Core
  2. LUCENE-4481

AnalyzingSuggester may fail to return correct topN suggestions

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1, Trunk
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I hit this when working on LUCENE-4480.

      Because AnalyzingSuggester may prune some of the topN paths found by FST's Util.TopNSearcher, this means the queue size limit of topN makes the overall search inadmissible, ie it may incorrectly prune paths that would have lead to a competitive path.

      However, such pruning is rare: it happens only for graph token streams, and even then only when competitive analyzed forms share the same surface forms.

      The simplest way to fix this is to make the queue unbounded but this is likely a sizable performance hit ... I haven't tested yet. It's even possible the way the dups happen (always at the "end" of the suggestion, because we tack on 0 byte followed by ord dedup byte) prevent this bug from even occurring and so this could all be a false alarm! I have to try to make a test case showing it ...

      A cop-out solution would be to expose a separate queueSize or queueMultiplier (over the topN) so that if users are affected by this they could crank up the queue size or multiplier.

      1. LUCENE-4481.patch
        2 kB
        Michael McCandless
      2. LUCENE-4481.patch
        5 kB
        Michael McCandless
      3. LUCENE-4481.patch
        2 kB
        Michael McCandless
      4. LUCENE-4481.patch
        12 kB
        Michael McCandless
      5. LUCENE-4481.patch
        10 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch, fixing one of the cases, where exactFirst=true may cause one too few paths to be returned.

        Show
        Michael McCandless added a comment - Patch, fixing one of the cases, where exactFirst=true may cause one too few paths to be returned.
        Hide
        Michael McCandless added a comment -

        There are still two other cases:

        • Duplicate paths (where same surface form had more than one analyzed form)
        • An opto that sets queue to null for the last path

        I'll first get test cases showing these bugs ...

        Show
        Michael McCandless added a comment - There are still two other cases: Duplicate paths (where same surface form had more than one analyzed form) An opto that sets queue to null for the last path I'll first get test cases showing these bugs ...
        Hide
        Michael McCandless added a comment -

        New patch w/ test case showing the 2nd bug (setting queue to null for last path). To fix this I just commented out that nulling (it's just an opto) and left a TODO to re-instate it if we can make it admissible.

        Show
        Michael McCandless added a comment - New patch w/ test case showing the 2nd bug (setting queue to null for last path). To fix this I just commented out that nulling (it's just an opto) and left a TODO to re-instate it if we can make it admissible.
        Hide
        Michael McCandless added a comment -

        New patch, with test for the third bug and simplest fix (unbounded
        queue). All tests pass ... so this is the starting point, and we can
        now (separately) try to add back the optimizations.

        Here's LookupBenchmarkTest with the patch:

        [junit4:junit4] Suite: org.apache.lucene.search.suggest.LookupBenchmarkTest
        [junit4:junit4]   2> -- construction time
        [junit4:junit4]   2> JaspellLookup   input: 50001, time[ms]: 23 [+- 4.29]
        [junit4:junit4]   2> TSTLookup       input: 50001, time[ms]: 71 [+- 5.10]
        [junit4:junit4]   2> FSTCompletionLookup input: 50001, time[ms]: 121 [+- 7.23]
        [junit4:junit4]   2> WFSTCompletionLookup input: 50001, time[ms]: 89 [+- 4.90]
        [junit4:junit4]   2> AnalyzingSuggester input: 50001, time[ms]: 237 [+- 27.55]
        [junit4:junit4] OK      11.9s | LookupBenchmarkTest.testConstructionTime
        [junit4:junit4]   2> -- prefixes: 2-4, num: 7, onlyMorePopular: true
        [junit4:junit4]   2> JaspellLookup   queries: 50001, time[ms]: 271 [+- 3.83], ~kQPS: 185
        [junit4:junit4]   2> TSTLookup       queries: 50001, time[ms]: 732 [+- 8.97], ~kQPS: 68
        [junit4:junit4]   2> FSTCompletionLookup queries: 50001, time[ms]: 121 [+- 4.34], ~kQPS: 413
        [junit4:junit4]   2> WFSTCompletionLookup queries: 50001, time[ms]: 338 [+- 4.64], ~kQPS: 148
        [junit4:junit4]   2> AnalyzingSuggester queries: 50001, time[ms]: 791 [+- 7.66], ~kQPS: 63
        [junit4:junit4] OK      46.1s | LookupBenchmarkTest.testPerformanceOnPrefixes2_4
        [junit4:junit4]   2> -- prefixes: 6-9, num: 7, onlyMorePopular: true
        [junit4:junit4]   2> JaspellLookup   queries: 50001, time[ms]: 101 [+- 3.26], ~kQPS: 496
        [junit4:junit4]   2> TSTLookup       queries: 50001, time[ms]: 88 [+- 2.78], ~kQPS: 568
        [junit4:junit4]   2> FSTCompletionLookup queries: 50001, time[ms]: 151 [+- 3.36], ~kQPS: 332
        [junit4:junit4]   2> WFSTCompletionLookup queries: 50001, time[ms]: 77 [+- 3.45], ~kQPS: 646
        [junit4:junit4]   2> AnalyzingSuggester queries: 50001, time[ms]: 272 [+- 3.87], ~kQPS: 184
        [junit4:junit4] OK      14.4s | LookupBenchmarkTest.testPerformanceOnPrefixes6_9
        [junit4:junit4]   2> -- RAM consumption
        [junit4:junit4]   2> JaspellLookup   size[B]:    9,815,152
        [junit4:junit4]   2> TSTLookup       size[B]:    9,858,792
        [junit4:junit4]   2> FSTCompletionLookup size[B]:      466,520
        [junit4:junit4]   2> WFSTCompletionLookup size[B]:      507,640
        [junit4:junit4]   2> AnalyzingSuggester size[B]:      889,138
        [junit4:junit4] OK      0.74s | LookupBenchmarkTest.testStorageNeeds
        [junit4:junit4]   2> -- prefixes: 100-200, num: 7, onlyMorePopular: true
        [junit4:junit4]   2> JaspellLookup   queries: 50001, time[ms]: 71 [+- 3.14], ~kQPS: 702
        [junit4:junit4]   2> TSTLookup       queries: 50001, time[ms]: 32 [+- 0.74], ~kQPS: 1561
        [junit4:junit4]   2> FSTCompletionLookup queries: 50001, time[ms]: 145 [+- 3.60], ~kQPS: 344
        [junit4:junit4]   2> WFSTCompletionLookup queries: 50001, time[ms]: 49 [+- 4.97], ~kQPS: 1029
        [junit4:junit4]   2> AnalyzingSuggester queries: 50001, time[ms]: 235 [+- 3.52], ~kQPS: 212
        [junit4:junit4] OK      11.3s | LookupBenchmarkTest.testPerformanceOnFullHits
        [junit4:junit4] Completed in 84.81s, 5 tests
        

        And on trunk:

        [junit4:junit4] <JUnit4> says olá! Master seed: 827F8DD5C0F3472D
        [junit4:junit4] Executing 1 suite with 1 JVM.
        [junit4:junit4] 
        [junit4:junit4] Suite: org.apache.lucene.search.suggest.LookupBenchmarkTest
        [junit4:junit4]   2> -- prefixes: 6-9, num: 7, onlyMorePopular: true
        [junit4:junit4]   2> JaspellLookup   queries: 50001, time[ms]: 114 [+- 2.83], ~kQPS: 439
        [junit4:junit4]   2> TSTLookup       queries: 50001, time[ms]: 66 [+- 2.06], ~kQPS: 762
        [junit4:junit4]   2> FSTCompletionLookup queries: 50001, time[ms]: 138 [+- 2.13], ~kQPS: 362
        [junit4:junit4]   2> WFSTCompletionLookup queries: 50001, time[ms]: 69 [+- 4.75], ~kQPS: 725
        [junit4:junit4]   2> AnalyzingSuggester queries: 50001, time[ms]: 260 [+- 5.00], ~kQPS: 192
        [junit4:junit4] OK      15.0s | LookupBenchmarkTest.testPerformanceOnPrefixes6_9
        [junit4:junit4]   2> -- construction time
        [junit4:junit4]   2> JaspellLookup   input: 50001, time[ms]: 22 [+- 3.19]
        [junit4:junit4]   2> TSTLookup       input: 50001, time[ms]: 64 [+- 2.32]
        [junit4:junit4]   2> FSTCompletionLookup input: 50001, time[ms]: 120 [+- 2.99]
        [junit4:junit4]   2> WFSTCompletionLookup input: 50001, time[ms]: 86 [+- 1.40]
        [junit4:junit4]   2> AnalyzingSuggester input: 50001, time[ms]: 232 [+- 3.98]
        [junit4:junit4] OK      10.7s | LookupBenchmarkTest.testConstructionTime
        [junit4:junit4]   2> -- prefixes: 100-200, num: 7, onlyMorePopular: true
        [junit4:junit4]   2> JaspellLookup   queries: 50001, time[ms]: 72 [+- 2.92], ~kQPS: 694
        [junit4:junit4]   2> TSTLookup       queries: 50001, time[ms]: 32 [+- 3.12], ~kQPS: 1556
        [junit4:junit4]   2> FSTCompletionLookup queries: 50001, time[ms]: 140 [+- 1.21], ~kQPS: 356
        [junit4:junit4]   2> WFSTCompletionLookup queries: 50001, time[ms]: 45 [+- 1.74], ~kQPS: 1102
        [junit4:junit4]   2> AnalyzingSuggester queries: 50001, time[ms]: 233 [+- 9.29], ~kQPS: 215
        [junit4:junit4] OK      11.0s | LookupBenchmarkTest.testPerformanceOnFullHits
        [junit4:junit4]   2> -- prefixes: 2-4, num: 7, onlyMorePopular: true
        [junit4:junit4]   2> JaspellLookup   queries: 50001, time[ms]: 257 [+- 3.21], ~kQPS: 194
        [junit4:junit4]   2> TSTLookup       queries: 50001, time[ms]: 510 [+- 5.35], ~kQPS: 98
        [junit4:junit4]   2> FSTCompletionLookup queries: 50001, time[ms]: 119 [+- 3.17], ~kQPS: 421
        [junit4:junit4]   2> WFSTCompletionLookup queries: 50001, time[ms]: 240 [+- 5.40], ~kQPS: 208
        [junit4:junit4]   2> AnalyzingSuggester queries: 50001, time[ms]: 595 [+- 8.07], ~kQPS: 84
        [junit4:junit4] OK      35.1s | LookupBenchmarkTest.testPerformanceOnPrefixes2_4
        [junit4:junit4]   2> -- RAM consumption
        [junit4:junit4]   2> JaspellLookup   size[B]:    9,815,152
        [junit4:junit4]   2> TSTLookup       size[B]:    9,858,792
        [junit4:junit4]   2> FSTCompletionLookup size[B]:      466,520
        [junit4:junit4]   2> WFSTCompletionLookup size[B]:      507,640
        [junit4:junit4]   2> AnalyzingSuggester size[B]:      889,138
        [junit4:junit4] OK      0.86s | LookupBenchmarkTest.testStorageNeeds
        [junit4:junit4] Completed in 72.97s, 5 tests
        

        So the lookup is definitely slower .. WFSTCompletionLookup is most
        heavily affected.

        Show
        Michael McCandless added a comment - New patch, with test for the third bug and simplest fix (unbounded queue). All tests pass ... so this is the starting point, and we can now (separately) try to add back the optimizations. Here's LookupBenchmarkTest with the patch: [junit4:junit4] Suite: org.apache.lucene.search.suggest.LookupBenchmarkTest [junit4:junit4] 2> -- construction time [junit4:junit4] 2> JaspellLookup input: 50001, time[ms]: 23 [+- 4.29] [junit4:junit4] 2> TSTLookup input: 50001, time[ms]: 71 [+- 5.10] [junit4:junit4] 2> FSTCompletionLookup input: 50001, time[ms]: 121 [+- 7.23] [junit4:junit4] 2> WFSTCompletionLookup input: 50001, time[ms]: 89 [+- 4.90] [junit4:junit4] 2> AnalyzingSuggester input: 50001, time[ms]: 237 [+- 27.55] [junit4:junit4] OK 11.9s | LookupBenchmarkTest.testConstructionTime [junit4:junit4] 2> -- prefixes: 2-4, num: 7, onlyMorePopular: true [junit4:junit4] 2> JaspellLookup queries: 50001, time[ms]: 271 [+- 3.83], ~kQPS: 185 [junit4:junit4] 2> TSTLookup queries: 50001, time[ms]: 732 [+- 8.97], ~kQPS: 68 [junit4:junit4] 2> FSTCompletionLookup queries: 50001, time[ms]: 121 [+- 4.34], ~kQPS: 413 [junit4:junit4] 2> WFSTCompletionLookup queries: 50001, time[ms]: 338 [+- 4.64], ~kQPS: 148 [junit4:junit4] 2> AnalyzingSuggester queries: 50001, time[ms]: 791 [+- 7.66], ~kQPS: 63 [junit4:junit4] OK 46.1s | LookupBenchmarkTest.testPerformanceOnPrefixes2_4 [junit4:junit4] 2> -- prefixes: 6-9, num: 7, onlyMorePopular: true [junit4:junit4] 2> JaspellLookup queries: 50001, time[ms]: 101 [+- 3.26], ~kQPS: 496 [junit4:junit4] 2> TSTLookup queries: 50001, time[ms]: 88 [+- 2.78], ~kQPS: 568 [junit4:junit4] 2> FSTCompletionLookup queries: 50001, time[ms]: 151 [+- 3.36], ~kQPS: 332 [junit4:junit4] 2> WFSTCompletionLookup queries: 50001, time[ms]: 77 [+- 3.45], ~kQPS: 646 [junit4:junit4] 2> AnalyzingSuggester queries: 50001, time[ms]: 272 [+- 3.87], ~kQPS: 184 [junit4:junit4] OK 14.4s | LookupBenchmarkTest.testPerformanceOnPrefixes6_9 [junit4:junit4] 2> -- RAM consumption [junit4:junit4] 2> JaspellLookup size[B]: 9,815,152 [junit4:junit4] 2> TSTLookup size[B]: 9,858,792 [junit4:junit4] 2> FSTCompletionLookup size[B]: 466,520 [junit4:junit4] 2> WFSTCompletionLookup size[B]: 507,640 [junit4:junit4] 2> AnalyzingSuggester size[B]: 889,138 [junit4:junit4] OK 0.74s | LookupBenchmarkTest.testStorageNeeds [junit4:junit4] 2> -- prefixes: 100-200, num: 7, onlyMorePopular: true [junit4:junit4] 2> JaspellLookup queries: 50001, time[ms]: 71 [+- 3.14], ~kQPS: 702 [junit4:junit4] 2> TSTLookup queries: 50001, time[ms]: 32 [+- 0.74], ~kQPS: 1561 [junit4:junit4] 2> FSTCompletionLookup queries: 50001, time[ms]: 145 [+- 3.60], ~kQPS: 344 [junit4:junit4] 2> WFSTCompletionLookup queries: 50001, time[ms]: 49 [+- 4.97], ~kQPS: 1029 [junit4:junit4] 2> AnalyzingSuggester queries: 50001, time[ms]: 235 [+- 3.52], ~kQPS: 212 [junit4:junit4] OK 11.3s | LookupBenchmarkTest.testPerformanceOnFullHits [junit4:junit4] Completed in 84.81s, 5 tests And on trunk: [junit4:junit4] <JUnit4> says olá! Master seed: 827F8DD5C0F3472D [junit4:junit4] Executing 1 suite with 1 JVM. [junit4:junit4] [junit4:junit4] Suite: org.apache.lucene.search.suggest.LookupBenchmarkTest [junit4:junit4] 2> -- prefixes: 6-9, num: 7, onlyMorePopular: true [junit4:junit4] 2> JaspellLookup queries: 50001, time[ms]: 114 [+- 2.83], ~kQPS: 439 [junit4:junit4] 2> TSTLookup queries: 50001, time[ms]: 66 [+- 2.06], ~kQPS: 762 [junit4:junit4] 2> FSTCompletionLookup queries: 50001, time[ms]: 138 [+- 2.13], ~kQPS: 362 [junit4:junit4] 2> WFSTCompletionLookup queries: 50001, time[ms]: 69 [+- 4.75], ~kQPS: 725 [junit4:junit4] 2> AnalyzingSuggester queries: 50001, time[ms]: 260 [+- 5.00], ~kQPS: 192 [junit4:junit4] OK 15.0s | LookupBenchmarkTest.testPerformanceOnPrefixes6_9 [junit4:junit4] 2> -- construction time [junit4:junit4] 2> JaspellLookup input: 50001, time[ms]: 22 [+- 3.19] [junit4:junit4] 2> TSTLookup input: 50001, time[ms]: 64 [+- 2.32] [junit4:junit4] 2> FSTCompletionLookup input: 50001, time[ms]: 120 [+- 2.99] [junit4:junit4] 2> WFSTCompletionLookup input: 50001, time[ms]: 86 [+- 1.40] [junit4:junit4] 2> AnalyzingSuggester input: 50001, time[ms]: 232 [+- 3.98] [junit4:junit4] OK 10.7s | LookupBenchmarkTest.testConstructionTime [junit4:junit4] 2> -- prefixes: 100-200, num: 7, onlyMorePopular: true [junit4:junit4] 2> JaspellLookup queries: 50001, time[ms]: 72 [+- 2.92], ~kQPS: 694 [junit4:junit4] 2> TSTLookup queries: 50001, time[ms]: 32 [+- 3.12], ~kQPS: 1556 [junit4:junit4] 2> FSTCompletionLookup queries: 50001, time[ms]: 140 [+- 1.21], ~kQPS: 356 [junit4:junit4] 2> WFSTCompletionLookup queries: 50001, time[ms]: 45 [+- 1.74], ~kQPS: 1102 [junit4:junit4] 2> AnalyzingSuggester queries: 50001, time[ms]: 233 [+- 9.29], ~kQPS: 215 [junit4:junit4] OK 11.0s | LookupBenchmarkTest.testPerformanceOnFullHits [junit4:junit4] 2> -- prefixes: 2-4, num: 7, onlyMorePopular: true [junit4:junit4] 2> JaspellLookup queries: 50001, time[ms]: 257 [+- 3.21], ~kQPS: 194 [junit4:junit4] 2> TSTLookup queries: 50001, time[ms]: 510 [+- 5.35], ~kQPS: 98 [junit4:junit4] 2> FSTCompletionLookup queries: 50001, time[ms]: 119 [+- 3.17], ~kQPS: 421 [junit4:junit4] 2> WFSTCompletionLookup queries: 50001, time[ms]: 240 [+- 5.40], ~kQPS: 208 [junit4:junit4] 2> AnalyzingSuggester queries: 50001, time[ms]: 595 [+- 8.07], ~kQPS: 84 [junit4:junit4] OK 35.1s | LookupBenchmarkTest.testPerformanceOnPrefixes2_4 [junit4:junit4] 2> -- RAM consumption [junit4:junit4] 2> JaspellLookup size[B]: 9,815,152 [junit4:junit4] 2> TSTLookup size[B]: 9,858,792 [junit4:junit4] 2> FSTCompletionLookup size[B]: 466,520 [junit4:junit4] 2> WFSTCompletionLookup size[B]: 507,640 [junit4:junit4] 2> AnalyzingSuggester size[B]: 889,138 [junit4:junit4] OK 0.86s | LookupBenchmarkTest.testStorageNeeds [junit4:junit4] Completed in 72.97s, 5 tests So the lookup is definitely slower .. WFSTCompletionLookup is most heavily affected.
        Hide
        Robert Muir added a comment -

        Though this is a pretty big hit (i look at prefixes 2-4), lets commit the fix
        for the bug first and then go back around to optimizations.

        Show
        Robert Muir added a comment - Though this is a pretty big hit (i look at prefixes 2-4), lets commit the fix for the bug first and then go back around to optimizations.
        Hide
        Michael McCandless added a comment -

        OK, new patch, this time adding back some optos:

        [junit4:junit4] Suite: org.apache.lucene.search.suggest.LookupBenchmarkTest
        [junit4:junit4]   2> -- RAM consumption
        [junit4:junit4]   2> JaspellLookup   size[B]:    9,815,152
        [junit4:junit4]   2> TSTLookup       size[B]:    9,858,792
        [junit4:junit4]   2> FSTCompletionLookup size[B]:      466,520
        [junit4:junit4]   2> WFSTCompletionLookup size[B]:      507,640
        [junit4:junit4]   2> AnalyzingSuggester size[B]:      889,138
        [junit4:junit4] OK      1.67s | LookupBenchmarkTest.testStorageNeeds
        [junit4:junit4]   2> -- prefixes: 6-9, num: 7, onlyMorePopular: true
        [junit4:junit4]   2> JaspellLookup   queries: 50001, time[ms]: 108 [+- 8.81], ~kQPS: 464
        [junit4:junit4]   2> TSTLookup       queries: 50001, time[ms]: 79 [+- 1.07], ~kQPS: 631
        [junit4:junit4]   2> FSTCompletionLookup queries: 50001, time[ms]: 148 [+- 2.54], ~kQPS: 339
        [junit4:junit4]   2> WFSTCompletionLookup queries: 50001, time[ms]: 67 [+- 2.78], ~kQPS: 745
        [junit4:junit4]   2> AnalyzingSuggester queries: 50001, time[ms]: 260 [+- 3.92], ~kQPS: 192
        [junit4:junit4] OK      14.6s | LookupBenchmarkTest.testPerformanceOnPrefixes6_9
        [junit4:junit4]   2> -- prefixes: 2-4, num: 7, onlyMorePopular: true
        [junit4:junit4]   2> JaspellLookup   queries: 50001, time[ms]: 262 [+- 5.16], ~kQPS: 191
        [junit4:junit4]   2> TSTLookup       queries: 50001, time[ms]: 641 [+- 6.46], ~kQPS: 78
        [junit4:junit4]   2> FSTCompletionLookup queries: 50001, time[ms]: 118 [+- 2.95], ~kQPS: 424
        [junit4:junit4]   2> WFSTCompletionLookup queries: 50001, time[ms]: 239 [+- 4.84], ~kQPS: 210
        [junit4:junit4]   2> AnalyzingSuggester queries: 50001, time[ms]: 660 [+- 7.39], ~kQPS: 76
        [junit4:junit4] OK      39.0s | LookupBenchmarkTest.testPerformanceOnPrefixes2_4
        [junit4:junit4]   2> -- construction time
        [junit4:junit4]   2> JaspellLookup   input: 50001, time[ms]: 23 [+- 4.20]
        [junit4:junit4]   2> TSTLookup       input: 50001, time[ms]: 64 [+- 2.06]
        [junit4:junit4]   2> FSTCompletionLookup input: 50001, time[ms]: 120 [+- 2.11]
        [junit4:junit4]   2> WFSTCompletionLookup input: 50001, time[ms]: 88 [+- 1.09]
        [junit4:junit4]   2> AnalyzingSuggester input: 50001, time[ms]: 245 [+- 27.85]
        [junit4:junit4] OK      10.9s | LookupBenchmarkTest.testConstructionTime
        [junit4:junit4]   2> -- prefixes: 100-200, num: 7, onlyMorePopular: true
        [junit4:junit4]   2> JaspellLookup   queries: 50001, time[ms]: 68 [+- 1.17], ~kQPS: 731
        [junit4:junit4]   2> TSTLookup       queries: 50001, time[ms]: 31 [+- 2.82], ~kQPS: 1617
        [junit4:junit4]   2> FSTCompletionLookup queries: 50001, time[ms]: 141 [+- 1.97], ~kQPS: 354
        [junit4:junit4]   2> WFSTCompletionLookup queries: 50001, time[ms]: 45 [+- 3.37], ~kQPS: 1099
        [junit4:junit4]   2> AnalyzingSuggester queries: 50001, time[ms]: 233 [+- 4.02], ~kQPS: 215
        [junit4:junit4] OK      11.1s | LookupBenchmarkTest.testPerformanceOnFullHits
        [junit4:junit4] Completed in 77.54s, 5 tests
        

        I added 2nd param (maxQueueDepth) to TopNSearcher, and fixed
        WFSTSuggester to pass topN for that (should get back most of its
        perf). I also fixed AnalyzingSuggester: we can bound how big a queue
        we need by the worst case number of analyzed forms for a single
        surface form. This is nice because if the analyzed doesn't create a
        graph then we should have close to same perf as before.

        Show
        Michael McCandless added a comment - OK, new patch, this time adding back some optos: [junit4:junit4] Suite: org.apache.lucene.search.suggest.LookupBenchmarkTest [junit4:junit4] 2> -- RAM consumption [junit4:junit4] 2> JaspellLookup size[B]: 9,815,152 [junit4:junit4] 2> TSTLookup size[B]: 9,858,792 [junit4:junit4] 2> FSTCompletionLookup size[B]: 466,520 [junit4:junit4] 2> WFSTCompletionLookup size[B]: 507,640 [junit4:junit4] 2> AnalyzingSuggester size[B]: 889,138 [junit4:junit4] OK 1.67s | LookupBenchmarkTest.testStorageNeeds [junit4:junit4] 2> -- prefixes: 6-9, num: 7, onlyMorePopular: true [junit4:junit4] 2> JaspellLookup queries: 50001, time[ms]: 108 [+- 8.81], ~kQPS: 464 [junit4:junit4] 2> TSTLookup queries: 50001, time[ms]: 79 [+- 1.07], ~kQPS: 631 [junit4:junit4] 2> FSTCompletionLookup queries: 50001, time[ms]: 148 [+- 2.54], ~kQPS: 339 [junit4:junit4] 2> WFSTCompletionLookup queries: 50001, time[ms]: 67 [+- 2.78], ~kQPS: 745 [junit4:junit4] 2> AnalyzingSuggester queries: 50001, time[ms]: 260 [+- 3.92], ~kQPS: 192 [junit4:junit4] OK 14.6s | LookupBenchmarkTest.testPerformanceOnPrefixes6_9 [junit4:junit4] 2> -- prefixes: 2-4, num: 7, onlyMorePopular: true [junit4:junit4] 2> JaspellLookup queries: 50001, time[ms]: 262 [+- 5.16], ~kQPS: 191 [junit4:junit4] 2> TSTLookup queries: 50001, time[ms]: 641 [+- 6.46], ~kQPS: 78 [junit4:junit4] 2> FSTCompletionLookup queries: 50001, time[ms]: 118 [+- 2.95], ~kQPS: 424 [junit4:junit4] 2> WFSTCompletionLookup queries: 50001, time[ms]: 239 [+- 4.84], ~kQPS: 210 [junit4:junit4] 2> AnalyzingSuggester queries: 50001, time[ms]: 660 [+- 7.39], ~kQPS: 76 [junit4:junit4] OK 39.0s | LookupBenchmarkTest.testPerformanceOnPrefixes2_4 [junit4:junit4] 2> -- construction time [junit4:junit4] 2> JaspellLookup input: 50001, time[ms]: 23 [+- 4.20] [junit4:junit4] 2> TSTLookup input: 50001, time[ms]: 64 [+- 2.06] [junit4:junit4] 2> FSTCompletionLookup input: 50001, time[ms]: 120 [+- 2.11] [junit4:junit4] 2> WFSTCompletionLookup input: 50001, time[ms]: 88 [+- 1.09] [junit4:junit4] 2> AnalyzingSuggester input: 50001, time[ms]: 245 [+- 27.85] [junit4:junit4] OK 10.9s | LookupBenchmarkTest.testConstructionTime [junit4:junit4] 2> -- prefixes: 100-200, num: 7, onlyMorePopular: true [junit4:junit4] 2> JaspellLookup queries: 50001, time[ms]: 68 [+- 1.17], ~kQPS: 731 [junit4:junit4] 2> TSTLookup queries: 50001, time[ms]: 31 [+- 2.82], ~kQPS: 1617 [junit4:junit4] 2> FSTCompletionLookup queries: 50001, time[ms]: 141 [+- 1.97], ~kQPS: 354 [junit4:junit4] 2> WFSTCompletionLookup queries: 50001, time[ms]: 45 [+- 3.37], ~kQPS: 1099 [junit4:junit4] 2> AnalyzingSuggester queries: 50001, time[ms]: 233 [+- 4.02], ~kQPS: 215 [junit4:junit4] OK 11.1s | LookupBenchmarkTest.testPerformanceOnFullHits [junit4:junit4] Completed in 77.54s, 5 tests I added 2nd param (maxQueueDepth) to TopNSearcher, and fixed WFSTSuggester to pass topN for that (should get back most of its perf). I also fixed AnalyzingSuggester: we can bound how big a queue we need by the worst case number of analyzed forms for a single surface form. This is nice because if the analyzed doesn't create a graph then we should have close to same perf as before.
        Hide
        Michael McCandless added a comment -

        Though this is a pretty big hit (i look at prefixes 2-4), lets commit the fix
        for the bug first and then go back around to optimizations.

        I agree – I'll make two commits here: first fixing the bugs, then adding some optos back. The optos don't fully recover the perf loss but they get much of it back ...

        Show
        Michael McCandless added a comment - Though this is a pretty big hit (i look at prefixes 2-4), lets commit the fix for the bug first and then go back around to optimizations. I agree – I'll make two commits here: first fixing the bugs, then adding some optos back. The optos don't fully recover the perf loss but they get much of it back ...
        Hide
        Dawid Weiss added a comment -

        I've been following the discussion but I don't get the original problem – could you provide an example of a token sequence (graph) for which the problem occurs? Is it in the patch somewhere (I looked at it but missed it somehow)?

        Show
        Dawid Weiss added a comment - I've been following the discussion but I don't get the original problem – could you provide an example of a token sequence (graph) for which the problem occurs? Is it in the patch somewhere (I looked at it but missed it somehow)?
        Hide
        Robert Muir added a comment -

        I think the general issue is the fact that one suggestion is stored in the fst multiple times if there is a graph at index time.

        So imagine a synonyms of dog=puppy, and a suggestion of hairy dog is stored twice in the fst, hairy dog->hairy dog but also hairy puppy->hairy dog.

        We don't want to return duplicate entries but also want a correct top n of unique suggestions

        Show
        Robert Muir added a comment - I think the general issue is the fact that one suggestion is stored in the fst multiple times if there is a graph at index time. So imagine a synonyms of dog=puppy, and a suggestion of hairy dog is stored twice in the fst, hairy dog->hairy dog but also hairy puppy->hairy dog. We don't want to return duplicate entries but also want a correct top n of unique suggestions
        Hide
        Dawid Weiss added a comment -

        Ok, get it. Bounding the priority queue is crucial to get good performance here so like Mike mentioned – it should probably be something adjustable with a sensible default so that extreme cases require manual tuning but it works fast for the average case. This could be solved by replacing synonym sets with a unique token but the number of such sets can grow exponentially so it's trading one beast for another – not worth the effort I think.

        Show
        Dawid Weiss added a comment - Ok, get it. Bounding the priority queue is crucial to get good performance here so like Mike mentioned – it should probably be something adjustable with a sensible default so that extreme cases require manual tuning but it works fast for the average case. This could be solved by replacing synonym sets with a unique token but the number of such sets can grow exponentially so it's trading one beast for another – not worth the effort I think.
        Hide
        Michael McCandless added a comment -

        The core problem is that the TopNSearcher assumes that every partial
        path will lead to at least one final path, but with the AnalyzingSuggester
        this is in general false because of token stream graphs.

        So as a first step I've turned off all queue pruning ... this fixes
        the bugs.

        But note that for WFSTCompletionLookup (why is this not WFSTSuggester?
        ) it's still true, so the solution is simple there: keep pruning
        the queue like before.

        For AnalyzingSuggsester my current solution is to bound the queue
        using topN * M where M is the max number of analyzed forms we ever saw
        for any surface form. So this means if you never make graphs we keep
        doing the same aggressive pruning as before.

        Show
        Michael McCandless added a comment - The core problem is that the TopNSearcher assumes that every partial path will lead to at least one final path, but with the AnalyzingSuggester this is in general false because of token stream graphs. So as a first step I've turned off all queue pruning ... this fixes the bugs. But note that for WFSTCompletionLookup (why is this not WFSTSuggester? ) it's still true, so the solution is simple there: keep pruning the queue like before. For AnalyzingSuggsester my current solution is to bound the queue using topN * M where M is the max number of analyzed forms we ever saw for any surface form. So this means if you never make graphs we keep doing the same aggressive pruning as before.
        Hide
        Michael McCandless added a comment -

        OK this is the optimizations patch ... I added maxQueueDepth to TopNSearcher and passed appropriate values from WFSTCompletionLookup and AnalyzingSuggester.

        I also fixed AnalyzingSuggester store/load to save the maxAnalyzedPathsForOneInput ... but in general these methods make me nervous: how come they don't take Directory or DataInput/Output like our other store APIs, they don't seem well tested, they don't write a header (I know we don't promise back-compat here but it'd at least be nice to get a clear exc)...

        Show
        Michael McCandless added a comment - OK this is the optimizations patch ... I added maxQueueDepth to TopNSearcher and passed appropriate values from WFSTCompletionLookup and AnalyzingSuggester. I also fixed AnalyzingSuggester store/load to save the maxAnalyzedPathsForOneInput ... but in general these methods make me nervous: how come they don't take Directory or DataInput/Output like our other store APIs, they don't seem well tested, they don't write a header (I know we don't promise back-compat here but it'd at least be nice to get a clear exc)...
        Hide
        Michael McCandless added a comment -

        Another possible fix for AnalyzingSuggester would be to "guess" an appropriate maxQueueDepth, run the search, and if the pruning becomes inadmissible (you can easily detect this by counting how many dup paths were actually rejected), then re-run the search with a larger guess, and iterate until you succeed.

        For syn-heavy (or otherwise graph-heavy) analyzers this could be a win over the current patch.

        Though if the analyzer is doing so much expansion presumably the app would have set the limit on max expansions which would then make the current patch fast(er) again.

        But I think we should separately explore that ...

        Show
        Michael McCandless added a comment - Another possible fix for AnalyzingSuggester would be to "guess" an appropriate maxQueueDepth, run the search, and if the pruning becomes inadmissible (you can easily detect this by counting how many dup paths were actually rejected), then re-run the search with a larger guess, and iterate until you succeed. For syn-heavy (or otherwise graph-heavy) analyzers this could be a win over the current patch. Though if the analyzer is doing so much expansion presumably the app would have set the limit on max expansions which would then make the current patch fast(er) again. But I think we should separately explore that ...
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Michael McCandless
        http://svn.apache.org/viewvc?view=revision&revision=1400635

        LUCENE-4481: add back some optos

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1400635 LUCENE-4481 : add back some optos
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Michael McCandless
        http://svn.apache.org/viewvc?view=revision&revision=1400417

        LUCENE-4481: turn off queue pruning (it's not in general admissible)

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1400417 LUCENE-4481 : turn off queue pruning (it's not in general admissible)

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development