Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16907

Parquet table reading performance regression when vectorized record reader is not used

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.1, 2.1.0
    • SQL
    • None

    Description

      For this parquet reading benchmark, Spark 2.0 is 20%-30% slower than Spark 1.6.

      // Test Env: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz, Intel SSD SC2KW24
      // Generates parquet table with nested columns
      spark.range(100000000).select(struct($"id").as("nc")).write.parquet("/tmp/data4")
      
      def time[R](block: => R): Long = {
          val t0 = System.nanoTime()
          val result = block    // call-by-name
          val t1 = System.nanoTime()
          println("Elapsed time: " + (t1 - t0)/1000000 + "ms")
          (t1 - t0)/1000000
      }
      
      val x = ((0 until 20).toList.map(x => time(spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))).sum/20
      

      Attachments

        Issue Links

          Activity

            People

              clockfly Sean Zhong
              clockfly Sean Zhong
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: