Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19842 Informational Referential Integrity Constraints Support in Spark
  3. SPARK-23750

[Performance] Inner Join Elimination based on Informational RI constraints

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • SQL
    • None

    Description

      Inner Join Elimination based on Informational RI constraints

      This transformation detects RI joins and eliminates the parent/PK table if none of its columns, other than the PK columns, are referenced in the query.

      Typical examples that benefit from this rewrite are queries over complex views.

      View using TPC-DS schema:

      create view customer_purchases_2002 (id, last, first, product, store_id, month, quantity) as
      select c_customer_id, c_last_name, c_first_name, i_product_name, s_store_id, d_moy, ss_quantity 
      from store_sales, date_dim, customer, item, store
      where d_date_sk = ss_sold_date_sk and
            c_customer_sk = ss_customer_sk and 
            i_item_sk = ss_item_sk and
            s_store_sk = ss_store_sk and
            d_year = 2002
      
      

      The view returns customer purchases made in year 2002. It is a join between fact table store_sales and dimensions customer, item, store, and date. The tables are joined using RI predicates.

      If we write a query that only selects a subset of columns from the view, for example, we are only interested in the items bought and not the stores, internally, the Optimizer, will first merge the view into the query, and then, based on the primary key – foreign key join predicate analysis, it will decide that the join with the store table is not needed, and therefore the store table is removed.

      Query:

      select id, first, last, product, quantity 
      from customer_purchases_2002
      where product like ‘bicycle%’ and
            month between 1 and 2
      

      Internal query after view expansion:

      select c_customer_id as id, c_first_name as first, c_last_name as last,
                 i_product_name as product,ss_quantity as quantity 
      from store_sales, date_dim, customer, item, store
      where d_date_sk = ss_sold_date_sk and
            c_customer_sk = ss_customer_sk and 
            i_item_sk = ss_item_sk and
            s_store_sk = ss_store_sk and
            d_year = 2002 and
            month between 1 and 2 and
            product like ‘bicycle%’
      

      Internal optimized query after join elimination:

      select c_customer_id as id, c_first_name as first, c_last_name as last,
                 i_product_name as product,ss_quantity as quantity 
      from store_sales, date_dim, customer, item
      where d_date_sk = ss_sold_date_sk and
            c_customer_sk = ss_customer_sk and 
            i_item_sk = ss_item_sk and
            d_year = 2002 and
            month between 1 and 2 and
            product like ‘bicycle%’
      

      The join with store table can be removed since no columns are retrieved from the table, and every row from the store_sales fact table will find a match in store based on the RI relationship.

      Attachments

        Activity

          People

            Unassigned Unassigned
            ioana-delaney Ioana Delaney
            Votes:
            2 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: