[SPARK-23753] [Performance] Group By Push Down through Join - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Group By push down through Join

Another transformation that benefits from RI constraints is Group By push down through joins. The transformation interchanges the order of the group-by and join operations. The benefit of pushing down a group-by is that it may reduce the number of input rows to the join. On the other hand, if the join is very selective, it might make sense to execute the group by after the join. That is why this transformation is in general applied based on cost or selectivity estimates.

However, if the join is an RI join, under certain conditions, it is safe to push down group by operation below the join. An example is shown below.

select c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name, 
       sum(ss.ss_quantity) as store_sales_quantity
from store_sales ss, date_dim, customer, store
where d_date_sk  = ss_sold_date_sk and
      c_customer_sk  = ss_customer_sk and
      s_store_sk  = ss_store_sk and
      d_year between 2000 and 2002
group by c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name

The query computes the quantities sold grouped by customer and store tables. The tables are in a star schema join. The grouping columns are a super set of the join keys. The aggregate columns come from the fact table store_sales. The group by operation can be pushed down to the fact table store_sales through the join with the customer and store tables. The join will not affect the partitions nor the aggregates computed by the pushed down group-by since every tuple in store_sales will join with a tuple in customer and store tables.

select c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name,
       v1.store_sales_quantity
from customer, store, (select ss_customer_sk, ss_store_sk, sum(ss_quantity) as store_sales_quantity
                       from store_sales, date_dim
                       where d_date_sk = ss_sold_date_sk and
                             d_year between 2000 and 2002
                       group by ss_customer_sk, ss_store_sk ) v1
where c_customer_sk = v1.ss_customer_sk and
      s_store_sk = v1.ss_store_sk

When the query is run using a 1TB TPC-DS setup, the group by reduces the number of rows from 1.5 billion to 100 million rows and the query execution drops from about 70 secs to 30 secs, a 2x improvement.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Ioana Delaney

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 20/Mar/18 18:43

Updated:: 29/Mar/20 02:37