[SPARK-17626] TPC-DS performance improvements using star-schema heuristics - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Umbrella
Status: Resolved
Priority: Critical
Resolution: Done
Affects Version/s: 2.1.0
Fix Version/s: 2.2.0
Component/s: SQL
Labels:
None

Target Version/s:

2.2.0

Description

TPC-DS performance improvements using star-schema heuristics

TPC-DS consists of multiple snowflake schema, which are multiple star schema with dimensions linking to dimensions. A star schema consists of a fact table referencing a number of dimension tables. Fact table holds the main data about a business. Dimension table, a usually smaller table, describes data reflecting the dimension/attribute of a business.

As part of the benchmark performance investigation, we observed a pattern of sub-optimal execution plans of large fact tables joins. Manual rewrite of some of the queries into selective fact-dimensions joins resulted in significant performance improvement. This prompted us to develop a simple join reordering algorithm based on star schema detection. The performance testing using 1TB TPC-DS workload shows an overall improvement of 19%.

Summary of the results:

Passed                 99
Failed                  0
Total q time (s)   14,962
Max time            1,467
Min time                3
Mean time             145
Geomean                44

Compared to baseline (Negative = improvement; Positive = Degradation):

End to end improved (%)              -19%	
Mean time improved (%)               -19%
Geomean improved (%)                 -24%
End to end improved (seconds)      -3,603
Number of queries improved (>10%)      45
Number of queries degraded (>10%)       6
Number of queries unchanged            48
Top 10 queries improved (%)          -20%

Cluster: 20-node cluster with each node having:

10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
Total memory for the cluster: 2.5TB
Total storage: 400TB
Total CPU cores: 480

Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA

Database info:

Schema: TPCDS
Scale factor: 1TB total space
Storage format: Parquet with Snappy compression

Our investigation and results are included in the attached document.

There are two parts to this improvement:

Join reordering using star schema detection
New selectivity hint to specify the selectivity of the predicates over base tables. Selectivity hint is optional and it was not used in the above TPC-DS tests.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

StarSchemaJoinReordering.pptx
21/Sep/16 18:17
81 kB
Ioana Delaney

Issue Links

is duplicated by

SPARK-17375 Star Join Optimization

Closed

Sub-Tasks

1.	Join reordering using star schema detection	Resolved	Ioana Delaney
2.	Refactor star schema code for the subsequent star join detection in CBO	Resolved	Ioana Delaney
3.	Apply star-join filter heuristics to dynamic programming join enumeration	Resolved	Ioana Delaney

Activity

People

Assignee:: Ioana Delaney

Reporter:: Ioana Delaney

Votes:: 1 Vote for this issue

Watchers:: 29 Start watching this issue

Dates

Created:: 21/Sep/16 18:10

Updated:: 08/Jan/18 20:41

Resolved:: 08/Jan/18 20:41