[SPARK-16321] [Spark 2.0] Performance regression when reading parquet and using PPD and non-vectorized reader - ASF JIRA

XML

Word

Printable

JSON

UPDATE
Please start with this comment
https://issues.apache.org/jira/browse/SPARK-16321?focusedCommentId=15383785&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15383785

I assume that problem results from the performance problem with reading parquet files

Original Issue description

I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is 2x slower.

df = sqlctx.read.parquet(path)
df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %100000 else []).collect()

Spark 1.6 -> 2.3 min
Spark 2.0 -> 4.6 min (2x slower)

I used BasicProfiler for this task and cumulative time was:
Spark 1.6 - 4300 sec
Spark 2.0 - 5800 sec

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

relates to

SPARK-16320 Document G1 heap region's effect on spark 2.0 vs 1.6

links to

[Github] Pull Request #13701 (viirya)