Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-4682

Reduce wasted bytes in FST due to array arcs

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/FSTs
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup.

      The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes.

      I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted.

      It would be nice to reduce this.

      One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion.

      Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ...

      Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ...

        Attachments

        1. fstByteStats.txt
          16 kB
          Michael McCandless
        2. kuromoji.wasted.bytes.txt
          256 kB
          Michael McCandless
        3. LUCENE-4682.patch
          5 kB
          Robert Muir

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                mikemccand Michael McCandless
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: