Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8868

New storing strategy for BKD tree leaves with low cardinality

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: main (9.0), 8.2
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently if a leaf on the BKD tree contains only few values, then the leaf is treated the same way as it all values are different. It many cases it can be much more efficient to store the distinct values with the cardinality.

      The strategy is the following:

      1. When writing a leaf block the cardinality is computed.
      2. Perform some naive calculation to compute if it is better to store the leaf as a low cardinality leaf. The storage cost are calculated as follows:

      • low cardinality: leafCardinality * (packedBytesLength - prefixLenSum + 2) where two is the estimated size of storing the cardinality. This is an overestimation as in some cases you will only need one byte to store the cardinality.
      • High cardinality: count * (packedBytesLength - prefixLenSum). We are not taking into account the runlen compression.

      3. If the tree has low cardinality then we set the compressed dim to -2. Note that -1 is when all values are equal.

        Attachments

          Activity

            People

            • Assignee:
              ivera Ignacio Vera
              Reporter:
              ivera Ignacio Vera
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 3h 50m
                3h 50m