[SPARK-16907] Parquet table reading performance regression when vectorized record reader is not used - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.1, 2.1.0
Component/s: SQL
Labels:
None

Description

For this parquet reading benchmark, Spark 2.0 is 20%-30% slower than Spark 1.6.

// Test Env: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz, Intel SSD SC2KW24
// Generates parquet table with nested columns
spark.range(100000000).select(struct($"id").as("nc")).write.parquet("/tmp/data4")

def time[R](block: => R): Long = {
    val t0 = System.nanoTime()
    val result = block    // call-by-name
    val t1 = System.nanoTime()
    println("Elapsed time: " + (t1 - t0)/1000000 + "ms")
    (t1 - t0)/1000000
}

val x = ((0 until 20).toList.map(x => time(spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))).sum/20

Attachments

Issue Links

relates to

SPARK-16320 Document G1 heap region's effect on spark 2.0 vs 1.6

Resolved

links to

[Github] Pull Request #14445 (clockfly)

Activity

People

Assignee:: Sean Zhong

Reporter:: Sean Zhong

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 05/Aug/16 01:16

Updated:: 05/Aug/16 03:34

Resolved:: 05/Aug/16 03:34