Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8084

FSTs can be very space-inefficient on array-expanded nodes

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None
    • None
    • New

    Description

      We have FSTs which operate on a larger alphabet (keys in int) space and emit character sequence outputs. I noticed that certain nodes get expanded into fixed-size arrays to accelerate lookups (binary search). This has a potential problem, however, when the outputs emit larger blobs of data (say, one of the outputs is very long, all the others are small). Then the fixed-size array is very much overallocated, as evident on the attached picture.

      I wonder if it'd be better to encode the array as fixed-size, but without the outputs and use a local fixed-size pointer into a "value pool" somewhere next to the node's arcs. Theoretically this "value pool" could even be shared by all of automaton's nodes (saved once at the end or flushed periodically).

      Just a thought.

      Attachments

        1. capture-4.png
          167 kB
          Dawid Weiss

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dweiss Dawid Weiss
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: