[CALCITE-4132] Estimate the number of distinct values more accurately - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.25.0
Component/s: core
Labels:
- pull-request-available

Description

Currently, we estimate the NDV of many operators based on the RelMdUtil#numDistinctVals method. This method estimates the expected number of distinct values selected n times (with replacement) from a collection with N distinct values. The estimation is based on the approximation when N approaches infinity.

However, when N is not a large number, the difference between the approximate and exact values can be notabe. In addtion, the error can be magnified by different combinations of N and n, which can lead the optimizer to make wrong decisions.

For example, when we select one element from a table with a 4-value enum, we expect to get one distinct value according to common sense. However, the current implementation gives 0.88, which is counter-intuitive, and leads to a 10+% error.

Therefore, we give the exact estimation based on the unbiased estimator (The proof is given in the comment).

Attachments

Issue Links

causes

CALCITE-4351 RelMdUtil#numDistinctVals always returns 0 for large inputs

Closed

links to

GitHub Pull Request #2077

Activity

People

Assignee:: Liya Fan

Reporter:: Liya Fan

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 20/Jul/20 02:03

Updated:: 23/Oct/20 08:01

Resolved:: 29/Jul/20 08:53

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 10m