Description
UPDATE
Please start with this comment
https://issues.apache.org/jira/browse/SPARK-16321?focusedCommentId=15383785&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15383785
I assume that problem results from the performance problem with reading parquet files
Original Issue description
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is 2x slower.
df = sqlctx.read.parquet(path) df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %100000 else []).collect()
Spark 1.6 -> 2.3 min
Spark 2.0 -> 4.6 min (2x slower)
I used BasicProfiler for this task and cumulative time was:
Spark 1.6 - 4300 sec
Spark 2.0 - 5800 sec
Should I expect such a drop in performance ?
I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?
Attachments
Attachments
Issue Links
- relates to
-
SPARK-16320 Document G1 heap region's effect on spark 2.0 vs 1.6
- Resolved
- links to