Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17626

TPC-DS performance improvements using star-schema heuristics

    XMLWordPrintableJSON

Details

    • Umbrella
    • Status: Resolved
    • Critical
    • Resolution: Done
    • 2.1.0
    • 2.2.0
    • SQL
    • None

    Description

      TPC-DS performance improvements using star-schema heuristics

      TPC-DS consists of multiple snowflake schema, which are multiple star schema with dimensions linking to dimensions. A star schema consists of a fact table referencing a number of dimension tables. Fact table holds the main data about a business. Dimension table, a usually smaller table, describes data reflecting the dimension/attribute of a business.

      As part of the benchmark performance investigation, we observed a pattern of sub-optimal execution plans of large fact tables joins. Manual rewrite of some of the queries into selective fact-dimensions joins resulted in significant performance improvement. This prompted us to develop a simple join reordering algorithm based on star schema detection. The performance testing using 1TB TPC-DS workload shows an overall improvement of 19%.

      Summary of the results:

      Passed                 99
      Failed                  0
      Total q time (s)   14,962
      Max time            1,467
      Min time                3
      Mean time             145
      Geomean                44
      

      Compared to baseline (Negative = improvement; Positive = Degradation):

      End to end improved (%)              -19%	
      Mean time improved (%)               -19%
      Geomean improved (%)                 -24%
      End to end improved (seconds)      -3,603
      Number of queries improved (>10%)      45
      Number of queries degraded (>10%)       6
      Number of queries unchanged            48
      Top 10 queries improved (%)          -20%
      

      Cluster: 20-node cluster with each node having:

      • 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
      • Total memory for the cluster: 2.5TB
      • Total storage: 400TB
      • Total CPU cores: 480

      Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA

      Database info:

      • Schema: TPCDS
      • Scale factor: 1TB total space
      • Storage format: Parquet with Snappy compression

      Our investigation and results are included in the attached document.

      There are two parts to this improvement:

      1. Join reordering using star schema detection
      2. New selectivity hint to specify the selectivity of the predicates over base tables. Selectivity hint is optional and it was not used in the above TPC-DS tests.

      Attachments

        1. StarSchemaJoinReordering.pptx
          81 kB
          Ioana Delaney

        Issue Links

          Activity

            People

              ioana-delaney Ioana Delaney
              ioana-delaney Ioana Delaney
              Votes:
              1 Vote for this issue
              Watchers:
              29 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: