Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8920

Reduce size of FSTs due to use of direct-addressing encoding

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 8.4
    • None
    • None
    • New

    Description

      Some data can lead to worst-case ~4x RAM usage due to this optimization. Several ideas were suggested to combat this on the mailing list:

      I think we can improve thesituation here by tracking, per-FST instance, the size increase we're seeing while building (or perhaps do a preliminary pass before building) in order to decide whether to apply the encoding.

      we could also make the encoding a bit more efficient. For instance I noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) which make gaps very costly. Associating each label with a dense id and having an intermediate lookup, ie. lookup label > id and then id>arc offset instead of doing label->arc directly could save a lot of space in some cases? Also it seems that we are repeating the label in the arc metadata when array-with-gaps is used, even though it shouldn't be necessary since the label is implicit from the address?

      Attachments

        1. TestTermsDictRamBytesUsed.java
          4 kB
          Adrien Grand

        Issue Links

          Activity

            People

              Unassigned Unassigned
              sokolov Michael Sokolov
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h 10m
                  5h 10m