Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17375

Star Join Optimization

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • None
    • None
    • SQL
    • None

    Description

      The star schema is the simplest style of data mart schema and is the approach
      often seen in BI/Decision Support systems. Star Join is a popular SQL query pattern that joins one or (a few) fact tables with a few dimension tables in star schemas. Star Join Query Optimizations aim to optimize the performance and use of resource for the star joins.

      Currently the existing Spark SQL optimization works on broadcasting the usually small (after filtering and projection) dimension tables to avoid costly shuffling of fact table and the "reduce" operations based on the join keys.

      This improvement proposal tries to further improve the broadcast star joins in the two areas:
      1) avoid materialization of the intermediate rows that otherwise could eventually not make to the final result row set after further joined with other dimensions that are more restricting;
      2) avoid the performance variations among different join orders. This could also have been largely achieved by cost analysis and heuristics and selecting a reasonably optimal join order. But we are here trying to achieve similar improvement without relying on such info.

      A preliminary test against a small TPCDS 1GB data set indicates between 5%-40% improvement (with codegen disabled on both tests) vs. the multiple broadcast joins on one Query (Q27) that inner joins 4 dimension table with one fact table. The large variation (5%-40%) is due to the different join ordering of the 4 broadcast joins. Tests using larger data sets and other TPCDS queries are yet to be performed.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yzhou2001 Yan
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: