Description
Executing a sql statement with a large number of partitions requires a high memory space for the driver even there are no requests to collect data back to the driver.
Here are steps to re-produce the issue.
1. Start spark shell with a spark.driver.maxResultSize setting
bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
2. Execute the code
case class Toto( a: Int, b: Int) val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( "toto2" ) // ERROR
The error message is
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 393 tasks (1025.9 KB) is bigger than spark.driver.maxResultSize (1024.0 KB)
Attachments
Issue Links
- duplicates
-
SPARK-14226 Caching a table with 1,100 columns and a few million rows fails
- Closed
- links to