Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-4132

Estimate the number of distinct values more accurately

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.25.0
    • Component/s: core

      Description

      Currently, we estimate the NDV of many operators based on the RelMdUtil#numDistinctVals method. This method estimates the expected number of distinct values selected n times (with replacement) from a collection with N distinct values. The estimation is based on the approximation when N approaches infinity.

      However, when N is not a large number, the difference between the approximate and exact values can be notabe. In addtion, the error can be magnified by different combinations of N and n, which can lead the optimizer to make wrong decisions.

      For example, when we select one element from a table with a 4-value enum, we expect to get one distinct value according to common sense. However, the current implementation gives 0.88, which is counter-intuitive, and leads to a 10+% error.

      Therefore, we give the exact estimation based on the unbiased estimator (The proof is given in the comment).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                fan_li_ya Liya Fan
                Reporter:
                fan_li_ya Liya Fan
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m