Description
TPC-DS performance improvements using star-schema heuristics
TPC-DS consists of multiple snowflake schema, which are multiple star schema with dimensions linking to dimensions. A star schema consists of a fact table referencing a number of dimension tables. Fact table holds the main data about a business. Dimension table, a usually smaller table, describes data reflecting the dimension/attribute of a business.
As part of the benchmark performance investigation, we observed a pattern of sub-optimal execution plans of large fact tables joins. Manual rewrite of some of the queries into selective fact-dimensions joins resulted in significant performance improvement. This prompted us to develop a simple join reordering algorithm based on star schema detection. The performance testing using 1TB TPC-DS workload shows an overall improvement of 19%.
Summary of the results:
Passed 99 Failed 0 Total q time (s) 14,962 Max time 1,467 Min time 3 Mean time 145 Geomean 44
Compared to baseline (Negative = improvement; Positive = Degradation):
End to end improved (%) -19% Mean time improved (%) -19% Geomean improved (%) -24% End to end improved (seconds) -3,603 Number of queries improved (>10%) 45 Number of queries degraded (>10%) 6 Number of queries unchanged 48 Top 10 queries improved (%) -20%
Cluster: 20-node cluster with each node having:
- 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
- Total memory for the cluster: 2.5TB
- Total storage: 400TB
- Total CPU cores: 480
Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA
Database info:
- Schema: TPCDS
- Scale factor: 1TB total space
- Storage format: Parquet with Snappy compression
Our investigation and results are included in the attached document.
There are two parts to this improvement:
- Join reordering using star schema detection
- New selectivity hint to specify the selectivity of the predicates over base tables. Selectivity hint is optional and it was not used in the above TPC-DS tests.
Attachments
Attachments
Issue Links
- is duplicated by
-
SPARK-17375 Star Join Optimization
- Closed