[SPARK-12837] Spark driver requires large memory space for serialized results even there are no data collected to the driver - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Question
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.5.2, 1.6.0
Fix Version/s: 2.2.0
Component/s: SQL
Labels:
None

Target Version/s:

2.0.0

Description

Executing a sql statement with a large number of partitions requires a high memory space for the driver even there are no requests to collect data back to the driver.

Here are steps to re-produce the issue.
1. Start spark shell with a spark.driver.maxResultSize setting

bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m

2. Execute the code

case class Toto( a: Int, b: Int)
val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF

sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK

sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( "toto2" ) // ERROR

The error message is

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 393 tasks (1025.9 KB) is bigger than spark.driver.maxResultSize (1024.0 KB)

Attachments

Issue Links

duplicates

SPARK-14226 Caching a table with 1,100 columns and a few million rows fails

Closed

links to

[Github] Pull Request #12899 (cloud-fan)

[Github] Pull Request #17596 (cloud-fan)

[Github] Pull Request #17931 (cloud-fan)

Activity

People

Assignee:: Wenchen Fan

Reporter:: Tien-Dung LE

Votes:: 2 Vote for this issue

Watchers:: 21 Start watching this issue

Dates

Created:: 15/Jan/16 13:45

Updated:: 09/Jul/21 10:33

Resolved:: 28/Apr/17 02:39