[HIVE-23684] Large underestimation in NDV stats when input and join cardinality ratio is big - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.0.0-alpha-1
Component/s: None
Labels:
- pull-request-available

Description

Large underestimations of NDV values may occur after a join operation since the current logic will decrease the original NDV values proportionally.

The code compares the number of rows of each relation before the join with the number of rows after the join and extracts a ratio for each side. Based on this ratio it adapts (reduces) the NDV accordingly.

Consider for instance the following query:

select inv_warehouse_sk
     , inv_item_sk
     , stddev_samp(inv_quantity_on_hand) stdev
     , avg(inv_quantity_on_hand)         mean
from inventory
   , date_dim
where inv_date_sk = d_date_sk
  and d_year = 1999
  and d_moy = 2
group by inv_warehouse_sk, inv_item_sk;

For the sake of the discussion, I outline below some relevant stats (from TPCDS30tb):
T(inventory) = 1627857000
T(date_dim) = 73049
T(inventory JOIN date_dim[d_year=1999 AND d_moy=2]) = 24948000
V(inventory, inv_date_sk) = 261
V(inventory, inv_item_sk) = 420000
V(inventory, inv_warehouse_sk) = 27
V(date_dim, inv, d_date_sk) = 73049

For instance, in this query the join between inventory and date_dim has ~24M rows while inventory has ~1.5B so the NDV of the columns coming from inventory are reduced by a factor of ~100 so we end up with V(JOIN, inv_item_sk) = ~6K while the real one is 231000.

Attachments

Issue Links

links to

GitHub Pull Request #1786

Activity

People

Assignee:: Vineet Garg

Reporter:: Stamatis Zampetakis

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Jun/20 12:27

Updated:: 17/Nov/22 08:47

Resolved:: 13/Jan/21 16:46

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

50m