[SPARK-8597] DataFrame partitionBy memory pressure scales extremely poorly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Blocker
Resolution: Duplicate
Affects Version/s: 1.4.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Target Version/s:

1.5.0
Sprint:
Spark 1.5 release

Description

I'm running into a strange memory scaling issue when using the partitionBy feature of DataFrameWriter.

I've generated a table (a CSV file) with 3 columns (A, B and C) and 32*32 different entries, with size on disk of about 20kb. There are 32 distinct values for column A and 32 distinct values for column B and all these are combined together (column C will contain a random number for each row - it doesn't matter) producing a 32*32 elements data set. I've imported this into Spark and I ran a partitionBy("A", "B") in order to test its performance. This should create a nested directory structure with 32 folders, each of them containing another 32 folders. It uses about 10Gb of RAM and it's running slow. If I increase the number of entries in the table from 32*32 to 128*128, I get Java Heap Space Out Of Memory no matter what value I use for Heap Space variable.

Scala code:

var df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("table.csv") 
df.write.partitionBy("A", "B").mode("overwrite").parquet("table.parquet”)

How I ran the Spark shell:

bin/spark-shell --driver-memory 16g --master local[8] --packages com.databricks:spark-csv_2.10:1.0.3

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

table.csv
24/Jun/15 19:15
17 kB
Matt Cheah

Issue Links

duplicates

SPARK-8890 Reduce memory consumption for dynamic partition insert

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Matt Cheah

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 24/Jun/15 19:14

Updated:: 26/Oct/15 01:05

Resolved:: 21/Jul/15 00:43