Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • SQL
    • None

    Description

      Group By push down through Join

      Another transformation that benefits from RI constraints is Group By push down through joins. The transformation interchanges the order of the group-by and join operations. The benefit of pushing down a group-by is that it may reduce the number of input rows to the join. On the other hand, if the join is very selective, it might make sense to execute the group by after the join. That is why this transformation is in general applied based on cost or selectivity estimates.

      However, if the join is an RI join, under certain conditions, it is safe to push down group by operation below the join. An example is shown below.

      select c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name, 
             sum(ss.ss_quantity) as store_sales_quantity
      from store_sales ss, date_dim, customer, store
      where d_date_sk  = ss_sold_date_sk and
            c_customer_sk  = ss_customer_sk and
            s_store_sk  = ss_store_sk and
            d_year between 2000 and 2002
      group by c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name
      

      The query computes the quantities sold grouped by customer and store tables. The tables are in a star schema join. The grouping columns are a super set of the join keys. The aggregate columns come from the fact table store_sales. The group by operation can be pushed down to the fact table store_sales through the join with the customer and store tables. The join will not affect the partitions nor the aggregates computed by the pushed down group-by since every tuple in store_sales will join with a tuple in customer and store tables.

      select c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name,
             v1.store_sales_quantity
      from customer, store, (select ss_customer_sk, ss_store_sk, sum(ss_quantity) as store_sales_quantity
                             from store_sales, date_dim
                             where d_date_sk = ss_sold_date_sk and
                                   d_year between 2000 and 2002
                             group by ss_customer_sk, ss_store_sk ) v1
      where c_customer_sk = v1.ss_customer_sk and
            s_store_sk = v1.ss_store_sk
      


      When the query is run using a 1TB TPC-DS setup, the group by reduces the number of rows from 1.5 billion to 100 million rows and the query execution drops from about 70 secs to 30 secs, a 2x improvement.

      Attachments

        Activity

          People

            Unassigned Unassigned
            ioana-delaney Ioana Delaney
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: