[LUCENE-3289] FST should allow controlling how hard builder tries to share suffixes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.4, 4.0-ALPHA
Component/s: None
Labels:
None

Lucene Fields:

New

Description

Today we have a boolean option to the FST builder telling it whether
it should share suffixes.

If you turn this off, building is much faster, uses much less RAM, and
the resulting FST is a prefix trie. But, the FST is larger than it
needs to be. When it's on, the builder maintains a node hash holding
every node seen so far in the FST – this uses up RAM and slows things
down.

On a dataset that Elmer (see java-user thread "Autocompletion on large
index" on Jul 6 2011) provided (thank you!), which is 1.32 M titles
avg 67.3 chars per title, building with suffix sharing on took 22.5
seconds, required 1.25 GB heap, and produced 91.6 MB FST. With suffix
sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST.

I think we should allow this boolean to be shade-of-gray instead:
usually, how well suffixes can share is a function of how far they are
from the end of the string, so, by adding a tunable N to only share
when suffix length < N, we can let caller make reasonable tradeoffs.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-3289.patch
07/Jul/11 17:10
17 kB
Michael McCandless
LUCENE-3289.patch
07/Jul/11 10:48
3 kB
Michael McCandless

Activity

People

Assignee:: Michael McCandless

Reporter:: Michael McCandless

Votes:: 1 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 07/Jul/11 10:45

Updated:: 28/Aug/22 12:52

Resolved:: 11/Jul/11 19:50