Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Works for Me
-
2.3.0
-
None
-
None
Description
I was testing spark 2.3 rc2 and I am seeing a performance regression in sql queries on cached data.
The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 partitions
Here is the example query:
val dailycached = spark.sql("select something from table where dt = '20170301' AND something IS NOT NULL")
dailycached.createOrReplaceTempView("dailycached") spark.catalog.cacheTable("dailyCached")
spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show()
On spark 2.2 I see queries times average 13 seconds
On the same nodes I see spark 2.3 queries times average 17 seconds
Note these are times of queries after the initial caching. so just running the last line again:
spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() multiple times.
I also ran a query over more data (335GB input/587.5 GB cached) and saw a similar discrepancy in the performance of querying cached data between spark 2.3 and spark 2.2, where 2.2 was better by like 20%.