Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
-
New
Description
We have FSTs which operate on a larger alphabet (keys in int) space and emit character sequence outputs. I noticed that certain nodes get expanded into fixed-size arrays to accelerate lookups (binary search). This has a potential problem, however, when the outputs emit larger blobs of data (say, one of the outputs is very long, all the others are small). Then the fixed-size array is very much overallocated, as evident on the attached picture.
I wonder if it'd be better to encode the array as fixed-size, but without the outputs and use a local fixed-size pointer into a "value pool" somewhere next to the node's arcs. Theoretically this "value pool" could even be shared by all of automaton's nodes (saved once at the end or flushed periodically).
Just a thought.
Attachments
Attachments
Issue Links
- relates to
-
LUCENE-4682 Reduce wasted bytes in FST due to array arcs
- Open