Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-4682

Reduce wasted bytes in FST due to array arcs

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • core/FSTs
    • None
    • New

    Description

      When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup.

      The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes.

      I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted.

      It would be nice to reduce this.

      One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion.

      Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ...

      Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ...

      Attachments

        1. fstByteStats.txt
          16 kB
          Michael McCandless
        2. kuromoji.wasted.bytes.txt
          256 kB
          Michael McCandless
        3. LUCENE-4682.patch
          5 kB
          Robert Muir

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mikemccand Michael McCandless
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: