Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Large underestimations of NDV values may occur after a join operation since the current logic will decrease the original NDV values proportionally.
The code compares the number of rows of each relation before the join with the number of rows after the join and extracts a ratio for each side. Based on this ratio it adapts (reduces) the NDV accordingly.
Consider for instance the following query:
select inv_warehouse_sk , inv_item_sk , stddev_samp(inv_quantity_on_hand) stdev , avg(inv_quantity_on_hand) mean from inventory , date_dim where inv_date_sk = d_date_sk and d_year = 1999 and d_moy = 2 group by inv_warehouse_sk, inv_item_sk;
For the sake of the discussion, I outline below some relevant stats (from TPCDS30tb):
T(inventory) = 1627857000
T(date_dim) = 73049
T(inventory JOIN date_dim[d_year=1999 AND d_moy=2]) = 24948000
V(inventory, inv_date_sk) = 261
V(inventory, inv_item_sk) = 420000
V(inventory, inv_warehouse_sk) = 27
V(date_dim, inv, d_date_sk) = 73049
For instance, in this query the join between inventory and date_dim has ~24M rows while inventory has ~1.5B so the NDV of the columns coming from inventory are reduced by a factor of ~100 so we end up with V(JOIN, inv_item_sk) = ~6K while the real one is 231000.
Attachments
Issue Links
- links to