[SPARK-23309] Spark 2.3 cached query performance 20-30% worse then spark 2.2 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Works for Me
Affects Version/s: 2.3.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

I was testing spark 2.3 rc2 and I am seeing a performance regression in sql queries on cached data.

The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 partitions

Here is the example query:

val dailycached = spark.sql("select something from table where dt = '20170301' AND something IS NOT NULL")

dailycached.createOrReplaceTempView("dailycached") spark.catalog.cacheTable("dailyCached")

spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show()

On spark 2.2 I see queries times average 13 seconds

On the same nodes I see spark 2.3 queries times average 17 seconds

Note these are times of queries after the initial caching. so just running the last line again:

spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() multiple times.

I also ran a query over more data (335GB input/587.5 GB cached) and saw a similar discrepancy in the performance of querying cached data between spark 2.3 and spark 2.2, where 2.2 was better by like 20%.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Thomas Graves

Votes:: 0 Vote for this issue

Watchers:: 22 Start watching this issue

Dates

Created:: 01/Feb/18 22:44

Updated:: 29/Jun/18 23:46

Resolved:: 29/Jun/18 15:57