Lucene - Core
  1. Lucene - Core
  2. LUCENE-3289

FST should allow controlling how hard builder tries to share suffixes

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.4, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today we have a boolean option to the FST builder telling it whether
      it should share suffixes.

      If you turn this off, building is much faster, uses much less RAM, and
      the resulting FST is a prefix trie. But, the FST is larger than it
      needs to be. When it's on, the builder maintains a node hash holding
      every node seen so far in the FST – this uses up RAM and slows things
      down.

      On a dataset that Elmer (see java-user thread "Autocompletion on large
      index" on Jul 6 2011) provided (thank you!), which is 1.32 M titles
      avg 67.3 chars per title, building with suffix sharing on took 22.5
      seconds, required 1.25 GB heap, and produced 91.6 MB FST. With suffix
      sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST.

      I think we should allow this boolean to be shade-of-gray instead:
      usually, how well suffixes can share is a function of how far they are
      from the end of the string, so, by adding a tunable N to only share
      when suffix length < N, we can let caller make reasonable tradeoffs.

      1. LUCENE-3289.patch
        17 kB
        Michael McCandless
      2. LUCENE-3289.patch
        3 kB
        Michael McCandless

        Activity

        Michael McCandless created issue -
        Michael McCandless made changes -
        Field Original Value New Value
        Attachment LUCENE-3289.patch [ 12485566 ]
        Michael McCandless made changes -
        Attachment LUCENE-3289.patch [ 12485623 ]
        Michael McCandless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development