[SPARK-16320] Document G1 heap region's effect on spark 2.0 vs 1.6 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.0.1, 2.1.0
Component/s: Documentation, SQL
Labels:
None

Description

I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1)

select count(*) where id > some_id

In this query performance is similar. (about 1 sec)

select count(*) where nested_column.id > some_id

Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

UPDATE
I created script to generate data and to confirm this problem.

#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
from pyspark.sql.functions import struct
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):
    
    def _create_column_data(cols):
        import random
        random.seed()
        return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in range(cols)}
    
    def _create_sample_df(cols, rows):
        rdd = sc.parallelize(range(rows))           
        data = rdd.map(lambda r: _create_column_data(cols))
        df = sqlctx.createDataFrame(data)
        return df
    
    def _create_nested_data(levels, rows):
        if len(levels) == 1:
            return _create_sample_df(levels[0], rows).cache()
        else:
            df = _create_nested_data(levels[1:], rows)
            return df.select([struct(df.columns).alias("column{}".format(i)) for i in range(levels[0])])
    df = _create_nested_data(levels, rows)
    df.write.mode('overwrite').parquet(path)
    
#Sample data
create_sample_data([2,10,200], 1000000, path)

#Query
df = sqlctx.read.parquet(path)

%%timeit
df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()

Results
Spark 1.6
1 loop, best of 3: 1min 5s per loop
Spark 2.0
1 loop, best of 3: 1min 21s per loop

UPDATE 2
Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same source.
I attached some VisualVM profiles there.
Most interesting are from queries.
https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

spark2-ui.png
02/Aug/16 14:48
103 kB
Maciej Bryński
spark1.6-ui.png
02/Aug/16 14:50
100 kB
Maciej Bryński

Issue Links

is related to

SPARK-16321 [Spark 2.0] Performance regression when reading parquet and using PPD and non-vectorized reader

Resolved

SPARK-15852 Improve query planning performance for wide nested schema

Resolved

SPARK-16907 Parquet table reading performance regression when vectorized record reader is not used

Resolved

relates to

SPARK-12384 Allow -Xms to be set differently then -Xmx

Resolved

links to

[Github] Pull Request #14445 (clockfly)

[Github] Pull Request #14465 (maver1ck)

[Github] Pull Request #14732 (srowen)

(2 links to)

Activity

People

Assignee:: Sean R. Owen

Reporter:: Maciej Bryński

Votes:: 1 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 30/Jun/16 07:21

Updated:: 22/Aug/16 18:19

Resolved:: 22/Aug/16 18:16