Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-4132

Estimate the number of distinct values more accurately

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.25.0
    • core

    Description

      Currently, we estimate the NDV of many operators based on the RelMdUtil#numDistinctVals method. This method estimates the expected number of distinct values selected n times (with replacement) from a collection with N distinct values. The estimation is based on the approximation when N approaches infinity.

      However, when N is not a large number, the difference between the approximate and exact values can be notabe. In addtion, the error can be magnified by different combinations of N and n, which can lead the optimizer to make wrong decisions.

      For example, when we select one element from a table with a 4-value enum, we expect to get one distinct value according to common sense. However, the current implementation gives 0.88, which is counter-intuitive, and leads to a 10+% error.

      Therefore, we give the exact estimation based on the unbiased estimator (The proof is given in the comment).

      Attachments

        Issue Links

          Activity

            People

              fan_li_ya Liya Fan
              fan_li_ya Liya Fan
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m