TPC-DS performance improvements using star-schema heuristics
TPC-DS consists of multiple snowflake schema, which are multiple star schema with dimensions linking to dimensions. A star schema consists of a fact table referencing a number of dimension tables. Fact table holds the main data about a business. Dimension table, a usually smaller table, describes data reflecting the dimension/attribute of a business.
As part of the benchmark performance investigation, we observed a pattern of sub-optimal execution plans of large fact tables joins. Manual rewrite of some of the queries into selective fact-dimensions joins resulted in significant performance improvement. This prompted us to develop a simple join reordering algorithm based on star schema detection. The performance testing using 1TB TPC-DS workload shows an overall improvement of 19%.
Summary of the results:
Compared to baseline (Negative = improvement; Positive = Degradation):
Cluster: 20-node cluster with each node having:
- 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
- Total memory for the cluster: 2.5TB
- Total storage: 400TB
- Total CPU cores: 480
Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA
- Schema: TPCDS
- Scale factor: 1TB total space
- Storage format: Parquet with Snappy compression
Our investigation and results are included in the attached document.
There are two parts to this improvement:
- Join reordering using star schema detection
- New selectivity hint to specify the selectivity of the predicates over base tables. Selectivity hint is optional and it was not used in the above TPC-DS tests.