Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-3289

FST should allow controlling how hard builder tries to share suffixes

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.4, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today we have a boolean option to the FST builder telling it whether
      it should share suffixes.

      If you turn this off, building is much faster, uses much less RAM, and
      the resulting FST is a prefix trie. But, the FST is larger than it
      needs to be. When it's on, the builder maintains a node hash holding
      every node seen so far in the FST – this uses up RAM and slows things
      down.

      On a dataset that Elmer (see java-user thread "Autocompletion on large
      index" on Jul 6 2011) provided (thank you!), which is 1.32 M titles
      avg 67.3 chars per title, building with suffix sharing on took 22.5
      seconds, required 1.25 GB heap, and produced 91.6 MB FST. With suffix
      sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST.

      I think we should allow this boolean to be shade-of-gray instead:
      usually, how well suffixes can share is a function of how far they are
      from the end of the string, so, by adding a tunable N to only share
      when suffix length < N, we can let caller make reasonable tradeoffs.

        Attachments

        1. LUCENE-3289.patch
          17 kB
          Michael McCandless
        2. LUCENE-3289.patch
          3 kB
          Michael McCandless

          Activity

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              mikemccand Michael McCandless
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: