Lucene - Core
  1. Lucene - Core
  2. LUCENE-3289

FST should allow controlling how hard builder tries to share suffixes

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.4, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today we have a boolean option to the FST builder telling it whether
      it should share suffixes.

      If you turn this off, building is much faster, uses much less RAM, and
      the resulting FST is a prefix trie. But, the FST is larger than it
      needs to be. When it's on, the builder maintains a node hash holding
      every node seen so far in the FST – this uses up RAM and slows things
      down.

      On a dataset that Elmer (see java-user thread "Autocompletion on large
      index" on Jul 6 2011) provided (thank you!), which is 1.32 M titles
      avg 67.3 chars per title, building with suffix sharing on took 22.5
      seconds, required 1.25 GB heap, and produced 91.6 MB FST. With suffix
      sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST.

      I think we should allow this boolean to be shade-of-gray instead:
      usually, how well suffixes can share is a function of how far they are
      from the end of the string, so, by adding a tunable N to only share
      when suffix length < N, we can let caller make reasonable tradeoffs.

      1. LUCENE-3289.patch
        17 kB
        Michael McCandless
      2. LUCENE-3289.patch
        3 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Initial rough patch showing the idea.

        Show
        Michael McCandless added a comment - Initial rough patch showing the idea.
        Hide
        Michael McCandless added a comment -

        NOTE: patch applies to 3.x.

        I ran the patch on the titles, varying the max prefix sharing length:

        Len FST Size Seconds
        1 135446807 8.2
        2 137632702 8.5
        3 135177994 8.3
        4 132782016 8.3
        5 130415331 8.4
        6 128086200 8.0
        7 125797396 8.2
        8 123552157 8.5
        9 121358375 8.4
        10 119228942 8.1
        11 117181180 8.8
        12 115229788 8.7
        13 113388260 9.5
        14 111664442 9.0
        15 110059167 9.2
        16 108572519 9.7
        17 107201905 9.8
        18 105942576 10.3
        19 104791497 10.1
        20 103745678 11.1
        21 102801693 10.8
        22 101957797 11.4
        23 101206564 11.1
        24 100541849 11.0
        25 99956443 11.1
        26 99443232 12.9
        27 98995194 13.2
        28 98604680 13.9
        29 98264184 13.5
        30 97969241 13.6
        31 97714049 13.8
        32 97494104 14.3
        33 97304045 14.0
        34 97140033 14.3
        35 96998942 14.6
        36 96877590 16.5
        37 96773039 16.9
        38 96682961 16.6
        39 96605160 17.8
        40 96537687 18.3
        41 96479286 17.8
        42 96428710 17.5
        43 96384659 18.9
        44 96346174 17.0
        45 96312826 19.3
        46 96283545 17.8
        47 96257708 19.4
        48 96235159 19.0
        49 96215220 18.7
        50 96197450 19.6
        51 96181539 17.3
        52 96167235 16.9
        53 96154490 17.7
        54 96143081 18.8
        55 96132905 17.4
        56 96123776 17.5
        57 96115462 20.7
        58 96108051 19.2
        59 96101249 19.1
        60 96095107 18.7
        ALL 96020343 22.5

        Very very odd that FST size first goes up at N=2... not yet sure why. But from this curve it looks like there is a sweet spot around maybe N=24. I didn't measure required heap here, but it also will go down as N goes down.

        Show
        Michael McCandless added a comment - NOTE: patch applies to 3.x. I ran the patch on the titles, varying the max prefix sharing length: Len FST Size Seconds 1 135446807 8.2 2 137632702 8.5 3 135177994 8.3 4 132782016 8.3 5 130415331 8.4 6 128086200 8.0 7 125797396 8.2 8 123552157 8.5 9 121358375 8.4 10 119228942 8.1 11 117181180 8.8 12 115229788 8.7 13 113388260 9.5 14 111664442 9.0 15 110059167 9.2 16 108572519 9.7 17 107201905 9.8 18 105942576 10.3 19 104791497 10.1 20 103745678 11.1 21 102801693 10.8 22 101957797 11.4 23 101206564 11.1 24 100541849 11.0 25 99956443 11.1 26 99443232 12.9 27 98995194 13.2 28 98604680 13.9 29 98264184 13.5 30 97969241 13.6 31 97714049 13.8 32 97494104 14.3 33 97304045 14.0 34 97140033 14.3 35 96998942 14.6 36 96877590 16.5 37 96773039 16.9 38 96682961 16.6 39 96605160 17.8 40 96537687 18.3 41 96479286 17.8 42 96428710 17.5 43 96384659 18.9 44 96346174 17.0 45 96312826 19.3 46 96283545 17.8 47 96257708 19.4 48 96235159 19.0 49 96215220 18.7 50 96197450 19.6 51 96181539 17.3 52 96167235 16.9 53 96154490 17.7 54 96143081 18.8 55 96132905 17.4 56 96123776 17.5 57 96115462 20.7 58 96108051 19.2 59 96101249 19.1 60 96095107 18.7 ALL 96020343 22.5 Very very odd that FST size first goes up at N=2... not yet sure why. But from this curve it looks like there is a sweet spot around maybe N=24. I didn't measure required heap here, but it also will go down as N goes down.
        Hide
        Michael McCandless added a comment -

        Patch, I think it's ready to commit!

        Separately we should think about how suggest module should set these... I left it at "costly but perfect minimization".

        Show
        Michael McCandless added a comment - Patch, I think it's ready to commit! Separately we should think about how suggest module should set these... I left it at "costly but perfect minimization".
        Hide
        Robert Muir added a comment -

        I think thats probably good for most cases?

        In the example you gave, it seems that FST might not be the best algorithm? The strings are extremely long (more like short documents) and probably need to be "compressed" in some different datastructure, e.g. a word-based one?

        Show
        Robert Muir added a comment - I think thats probably good for most cases? In the example you gave, it seems that FST might not be the best algorithm? The strings are extremely long (more like short documents) and probably need to be "compressed" in some different datastructure, e.g. a word-based one?
        Hide
        Michael McCandless added a comment -

        Yeah I think "costly but perfect minimization" is the right default.

        Show
        Michael McCandless added a comment - Yeah I think "costly but perfect minimization" is the right default.
        Hide
        Dawid Weiss added a comment -

        Exactly. This is a very specific use case (long suggestions).

        Show
        Dawid Weiss added a comment - Exactly. This is a very specific use case (long suggestions).
        Hide
        Eks Dev added a comment -

        The strings are extremely long (more like short documents) and probably need to be "compressed" in some different datastructure, e.g. a word-based one?

        That would be indeed cool, e.g. FST with words (ngrams?) as symbols. Ages ago we used one trie, for all unique terms to get prefix/edit distance on words and one word-trie (symbols were words via symbol table) for "documents". I am sure this would cut memory requirements significantly for multiword cases when compared to char level FST.
        e.g. TermDictionary that supports ord() could be used as a symbol table.

        Show
        Eks Dev added a comment - The strings are extremely long (more like short documents) and probably need to be "compressed" in some different datastructure, e.g. a word-based one? That would be indeed cool, e.g. FST with words (ngrams?) as symbols. Ages ago we used one trie, for all unique terms to get prefix/edit distance on words and one word-trie (symbols were words via symbol table) for "documents". I am sure this would cut memory requirements significantly for multiword cases when compared to char level FST. e.g. TermDictionary that supports ord() could be used as a symbol table.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development