Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-21382

Group by keys reduction optimization - keys are not reduced in query23

    XMLWordPrintableJSON

Details

    Description

      explain cbo with frequent_ss_items as 
       (select substr(i_item_desc,1,30) itemdesc,i_item_sk item_sk,d_date solddate,count(*) cnt
        from store_sales
            ,date_dim 
            ,item
        where ss_sold_date_sk = d_date_sk
          and ss_item_sk = i_item_sk 
          and d_year in (1999,1999+1,1999+2,1999+3)
        group by substr(i_item_desc,1,30),i_item_sk,d_date
        having count(*) >4)
      select  sum(sales)
       from ((select cs_quantity*cs_list_price sales
             from catalog_sales
                 ,date_dim 
             where d_year = 1999 
               and d_moy = 1 
               and cs_sold_date_sk = d_date_sk 
               and cs_item_sk in (select item_sk from frequent_ss_items))) subq limit 100;
      
      HiveSortLimit(fetch=[100])
        HiveProject($f0=[$0])
          HiveAggregate(group=[{}], agg#0=[sum($0)])
            HiveProject(sales=[*(CAST($2):DECIMAL(10, 0), $3)])
              HiveSemiJoin(condition=[=($1, $5)], joinType=[inner])
                HiveJoin(condition=[=($0, $4)], joinType=[inner], algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}])
                  HiveProject(cs_sold_date_sk=[$0], cs_item_sk=[$15], cs_quantity=[$18], cs_list_price=[$20])
                    HiveFilter(condition=[IS NOT NULL($0)])
                      HiveTableScan(table=[[perf_constraints, catalog_sales]], table:alias=[catalog_sales])
                  HiveProject(d_date_sk=[$0])
                    HiveFilter(condition=[AND(=($6, 1999), =($8, 1))])
                      HiveTableScan(table=[[perf_constraints, date_dim]], table:alias=[date_dim])
                HiveProject(i_item_sk=[$1])
                  HiveFilter(condition=[>($3, 4)])
                    HiveProject(substr=[$2], i_item_sk=[$1], d_date=[$0], $f3=[$3])
                      HiveAggregate(group=[{3, 4, 5}], agg#0=[count()])
                        HiveJoin(condition=[=($1, $4)], joinType=[inner], algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}])
                          HiveJoin(condition=[=($0, $2)], joinType=[inner], algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}])
                            HiveProject(ss_sold_date_sk=[$0], ss_item_sk=[$2])
                              HiveFilter(condition=[IS NOT NULL($0)])
                                HiveTableScan(table=[[perf_constraints, store_sales]], table:alias=[store_sales])
                            HiveProject(d_date_sk=[$0], d_date=[$2])
                              HiveFilter(condition=[IN($6, 1999, 2000, 2001, 2002)])
                                HiveTableScan(table=[[perf_constraints, date_dim]], table:alias=[date_dim])
                          HiveProject(i_item_sk=[$0], substr=[substr($4, 1, 30)])
                            HiveTableScan(table=[[perf_constraints, item]], table:alias=[item])
      

      Right side of HiveSemiJoin has an aggregate which could be reduce to have only i_item_sk as group by key since i_item_sk is primary key.

      Attachments

        1. HIVE-21382.1.patch
          13 kB
          Vineet Garg
        2. HIVE-21382.2.patch
          125 kB
          Vineet Garg
        3. HIVE-21382.2.patch
          125 kB
          Vineet Garg
        4. HIVE-21382.3.patch
          144 kB
          Vineet Garg
        5. HIVE-21382.4.patch
          143 kB
          Vineet Garg
        6. HIVE-21382.5.patch
          143 kB
          Vineet Garg
        7. HIVE-21382.6.patch
          143 kB
          Vineet Garg

        Issue Links

          Activity

            People

              vgarg Vineet Garg
              vgarg Vineet Garg
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 10m
                  3h 10m