Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-23684

Large underestimation in NDV stats when input and join cardinality ratio is big

    XMLWordPrintableJSON

Details

    Description

      Large underestimations of NDV values may occur after a join operation since the current logic will decrease the original NDV values proportionally.

      The code compares the number of rows of each relation before the join with the number of rows after the join and extracts a ratio for each side. Based on this ratio it adapts (reduces) the NDV accordingly.

      Consider for instance the following query:

      select inv_warehouse_sk
           , inv_item_sk
           , stddev_samp(inv_quantity_on_hand) stdev
           , avg(inv_quantity_on_hand)         mean
      from inventory
         , date_dim
      where inv_date_sk = d_date_sk
        and d_year = 1999
        and d_moy = 2
      group by inv_warehouse_sk, inv_item_sk;
      

      For the sake of the discussion, I outline below some relevant stats (from TPCDS30tb):
      T(inventory) = 1627857000
      T(date_dim) = 73049
      T(inventory JOIN date_dim[d_year=1999 AND d_moy=2]) = 24948000
      V(inventory, inv_date_sk) = 261
      V(inventory, inv_item_sk) = 420000
      V(inventory, inv_warehouse_sk) = 27
      V(date_dim, inv, d_date_sk) = 73049

      For instance, in this query the join between inventory and date_dim has ~24M rows while inventory has ~1.5B so the NDV of the columns coming from inventory are reduced by a factor of ~100 so we end up with V(JOIN, inv_item_sk) = ~6K while the real one is 231000.

      Attachments

        Issue Links

          Activity

            People

              vgarg Vineet Garg
              zabetak Stamatis Zampetakis
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m