[SPARK-13570] pyspark save with partitionBy is very slow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: None
Fix Version/s: None
Component/s: PySpark
Labels:
- dataframe
- partitioning
- pyspark
- save

Description

Running the following code to store data from each year and pos in a seperate folder for a very large dataframe is taking a huge amount of time. (>37 hours for 60% of the work)

## IPYTHON was started using the following command: 
# IPYTHON=1 "$SPARK_HOME/bin/pyspark" --driver-memory 50g 



from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, Row
from pyspark.sql.types import *

conf = SparkConf()
conf.setMaster("local[30]")
conf.setAppName("analysis")
conf.set("spark.local.dir", "./tmp")
conf.set("spark.executor.memory", "50g")
conf.set("spark.driver.maxResultSize", "5g")

sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)


df = sqlContext.read.format("csv").options(header=False, inferschema=True, delimiter="\t").load("out/new_features")
df = df.selectExpr(*("%s as %s" % (df.columns[i], k) for i,k in enumerate(columns)))


# year can take values from [1902,2015]
# pos takes integer values from [-1,0,1,2]
# df is a dataframe with 20 columns and 1 billion rows
# Running on  Machine with 32 cores and 500 GB RAM
df.write.save("out/model_input_partitioned", format="csv", partitionBy=["year", "pos"], delimiter="\t")

Currently, the code is at:
[Stage 12:==============================> (1367 + 30) / 2290]

And it has already been more than 37 hours. A single sweep on this data for filter by value takes less than 6.5 minutes.
The spark web interface shows the following lines for the 2 stages of the job:
Stage Description Submitted Duration Tasks:succeeded/total Input Output Shuffle Read Shuffle Write
11 load at NativeMethodAccessorImpl.java:-2 +details 2016/02/27 23:07:04 6.5 min 2290/2290 66.8 GB
12 save at NativeMethodAccessorImpl.java:-2 +details 2016/02/27 23:15:59 37.1 h 1370/2290 40.9 GB

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Shubhanshu Mishra

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 29/Feb/16 18:26

Updated:: 07/Jun/16 14:44

Resolved:: 07/Jun/16 14:44