Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23309

Spark 2.3 cached query performance 20-30% worse then spark 2.2

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Works for Me
    • 2.3.0
    • None
    • SQL
    • None

    Description

      I was testing spark 2.3 rc2 and I am seeing a performance regression in sql queries on cached data.

      The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 partitions

      Here is the example query:

      val dailycached = spark.sql("select something from table where dt = '20170301' AND something IS NOT NULL")

      dailycached.createOrReplaceTempView("dailycached") spark.catalog.cacheTable("dailyCached")

      spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show()

       

      On spark 2.2 I see queries times average 13 seconds

      On the same nodes I see spark 2.3 queries times average 17 seconds

      Note these are times of queries after the initial caching.  so just running the last line again: 

      spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() multiple times.

       

      I also ran a query over more data (335GB input/587.5 GB cached) and saw a similar discrepancy in the performance of querying cached data between spark 2.3 and spark 2.2, where 2.2 was better by like 20%.  

      Attachments

        Activity

          People

            Unassigned Unassigned
            tgraves Thomas Graves
            Votes:
            0 Vote for this issue
            Watchers:
            22 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: