Currently, we estimate the NDV of many operators based on the RelMdUtil#numDistinctVals method. This method estimates the expected number of distinct values selected n times (with replacement) from a collection with N distinct values. The estimation is based on the approximation when N approaches infinity.
However, when N is not a large number, the difference between the approximate and exact values can be notabe. In addtion, the error can be magnified by different combinations of N and n, which can lead the optimizer to make wrong decisions.
For example, when we select one element from a table with a 4-value enum, we expect to get one distinct value according to common sense. However, the current implementation gives 0.88, which is counter-intuitive, and leads to a 10+% error.
Therefore, we give the exact estimation based on the unbiased estimator (The proof is given in the comment).