Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14032

Eliminate Unnecessary Distinct/Aggregate

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.0.0
    • None
    • SQL

    Description

      Distinct is an expensive operation. If possible, we should avoid it. When the child operators can guarantee the distinct, we can remove it.

      For example, in the following TPC-DS query 38, the child is distinct, and thus, we can remove the top Distinct after converting Intersect to Left-semi + Distinct.

      select count(*) from (
          select distinct c_last_name, c_first_name, d_date
          from store_sales, date_dim, customer
                where store_sales.ss_sold_date_sk = date_dim.d_date_sk
            and store_sales.ss_customer_sk = customer.c_customer_sk
            and d_month_seq between [DMS] and [DMS] + 11
        intersect
          select distinct c_last_name, c_first_name, d_date
          from catalog_sales, date_dim, customer
                where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
            and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
            and d_month_seq between [DMS] and [DMS] + 11
        intersect
          select distinct c_last_name, c_first_name, d_date
          from web_sales, date_dim, customer
                where web_sales.ws_sold_date_sk = date_dim.d_date_sk
            and web_sales.ws_bill_customer_sk = customer.c_customer_sk
            and d_month_seq between [DMS] and [DMS] + 11
      ) hot_cyst
      
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            smilegator Xiao Li
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: