[LUCENE-3297] FST doesn't fully share common prefix across all outputs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: core/FSTs
Labels:
None

Lucene Fields:

New

Description

FST will try to share prefixes of outputs when possible, however in the [I think unusual in practice] case where all outputs share a common prefix, FST really ought to store this just once, on the root arc, but instead it's only able to push back to the N root arcs. It's sort of an off-by-one on how far back the pushing goes...

One [synthetic] example where this makes a big difference is the new Test2BPostings test, when it uses MemoryCodec, because this test has 26 terms (letters of alphabet) and each term has exactly the same long (~85 MB) all 1s byte[] as the postings. If we fixed this issue, then the resulting FST would only be ~85 MB but now instead it needs to be ~85 * 26 MB.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Michael McCandless

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Jul/11 13:01

Updated:: 28/Aug/22 12:52