In the current implementation FST states are expanded into a binary search on labels (from a vector of transitions) when the child count of a state exceeds a given predefined threshold (NUM_ARCS_FIXED_ARRAY). This threshold affects automaton size and traversal speed (as it turns out when benchmarked). For some degenerate data sets, close-to-the-root nodes could have a small number of children (below the threshold) and yet be traversed on every single seek.
A fix of this is to introduce two control thresholds:
EXPAND state if (distance-from-root <= MIN_DISTANCE || children-count >= NUM_ARCS_FIXED_ARRAY)
My plan is to create a data set that will prove this first and then to implement the workaround above.
|Field||Original Value||New Value|
|Assignee||Dawid Weiss [ dweiss ]|
|Attachment||LUCENE-2933.patch [ 12471560 ]|
|Status||Open [ 1 ]||Resolved [ 5 ]|
|Resolution||Fixed [ 1 ]|
|Status||Resolved [ 5 ]||Closed [ 6 ]|
|Transition||Time In Source Status||Execution Times||Last Executer||Last Execution Date|
|2d 1h 27m||1||Dawid Weiss||23/Feb/11 10:17|
|807d 26m||1||Uwe Schindler||10/May/13 10:44|