Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-366

Support Aggregate push down in bushy joins

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.2.0-incubating
    • None

    Description

      Currently OPTIQ doesn't push down aggregates for bushy joins and this can be a performance limitation in some queries.

      Ideally before we push down the aggregate we should take into the account the NDV and reduction of the join which we are trying to push the aggregate underneath.

      On Hive to measure potential gains I wrote a query which has the following business question based on the TPC-DS schema:
      Identify customers from specific gender with good credit rating that spent more than X amount of money during a specific year.

      This query has 3 main components
      1) Hub containing store_sales & date_dim
      2) Hub containing customer and customer_demographics
      3) Aggregate

      And for that query I have 2 rewrites :
      1) Rewrite that generates a bushy join plan
      2) Rewrite that generates a bushy join plan and pushes the aggregate down

      Results show that aggregate push down in a bushy plan cuts down the query runtime by half.
      The aggregate push down reduces the intermediate data in the query from 384GB to 22GB.

      Base query finishes in 4,781 seconds with 2,253 GB of intermediate data
      Bushy query finished in 1,374 seconds with 384GB of intermediate data
      Bushy query + aggregate push down finished in 605 with only 22GB of intermediate data.

      Base query

      "select 
          c_customer_sk, sum(ss_net_paid) as net_paid
      from
          customer,
          customer_demographics,
          store_sales,
          date_dim
      where
          c_current_cdemo_sk = cd_demo_sk
              and cd_credit_rating = 'Good'
              and cd_gender = 'F'
              and ss_customer_sk = c_customer_sk
              and ss_customer_sk is not null
              and d_year = 2001
              and d_date_sk = ss_sold_date_sk
      group by c_customer_sk
      having net_paid > 0
      order by net_paid desc
      limit 100;"
      

      Bushy query

      with ss as 
      ( 
      select 
          ss_net_paid, ss_customer_sk
      from
          store_sales,date_dim
      where
          d_year = 2001
              and d_date_sk = ss_sold_date_sk
              and ss_customer_sk is not null
      ), cc as 
      (
      select 
          c_customer_sk
      from
          customer,customer_demographics
      where
          c_current_cdemo_sk = cd_demo_sk
              and cd_credit_rating = 'Good'
              and cd_gender = 'F'
      )
      select 
          c_customer_sk, sum(ss_net_paid) as net_paid
      from 
          cc,ss
      where
          ss_customer_sk = c_customer_sk
      group by c_customer_sk
      having net_paid > 0
      order by net_paid desc
      limit 100;
      

      Bushy query + aggregate push down

      
      

      Attachments

        Issue Links

          Activity

            People

              jcamacho Jesús Camacho Rodríguez
              mmokhtar Mostafa Mokhtar
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: