Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • SQL
    • None

    Description

      Existential Subquery to Inner Join

      Another enhancement that uses Informational Constraints is existential subquery to inner join. This rewrite converts an existential subquery to an inner join, and thus provides alternative join choices for the Optimizer based on the selectivity of the tables.

      An example using TPC-DS schema is shown below.

      select c_first_name, c_last_name, c_email_address
      from customer c
      where EXISTS (select * 
                    from store_sales, date_dim
                    where c.c_customer_sk = ss_customer_sk and
                          ss_sold_date_sk = d_date_sk and
                          d_year = 2002 and
                          d_moy between 4 and 4+3)
      

      Spark uses left semi-join to evaluated existential subqueries. A left semi-join will return a row from the outer table if there is at least one match in the inner. Semi-join is a general used technique to rewrite existential subqueries, but it has some limitations as it imposes a certain order of the joined table. In this case the large fact table store_sales has to be on the inner of the join. A more efficient execution can be obtained if the subquery is converted to a regular Inner join. This will allow the Optimizer to choose better join orders.

      Converting a subquery to inner join is possible if either the subquery produces at most one row or, by introducing a Distinct on the outer table’s row key in order to remove the duplicate rows that will result after the inner join and thus to enforce the semantics of the subquery. As a key for the outer, we can use the primary key of the customer table.

      Internal query after rewrite:

      select distinct c_customer_sk /*PK */, c_first_name, c_last_name, c_email_address
      from customer c, store_sales, date_dim
      where c.c_customer_sk = ss_customer_sk and
             ss_sold_date_sk = d_date_sk and
             d_year = 2002 and
             d_moy between 4 and 4+3
      


      Example performance results using 1TB TPC-DS benchmark:


      TPC-DS Query spark-2.2 spark-2.2 w/ sub2join Query speedup
        (secs) (secs)  
      Q10 355 190 2x
      Q16 1394 706 2x
      Q35 462 285 1.5x
      Q69 327 173 1.5x
      Q94 603 307 2x

      Attachments

        Activity

          People

            Unassigned Unassigned
            ioana-delaney Ioana Delaney
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: