Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12837

Spark driver requires large memory space for serialized results even there are no data collected to the driver

    XMLWordPrintableJSON

Details

    • Question
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.5.2, 1.6.0
    • 2.2.0
    • SQL
    • None

    Description

      Executing a sql statement with a large number of partitions requires a high memory space for the driver even there are no requests to collect data back to the driver.

      Here are steps to re-produce the issue.
      1. Start spark shell with a spark.driver.maxResultSize setting

      bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
      

      2. Execute the code

      case class Toto( a: Int, b: Int)
      val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
      
      sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
      df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
      
      sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
      df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( "toto2" ) // ERROR
      

      The error message is

      Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 393 tasks (1025.9 KB) is bigger than spark.driver.maxResultSize (1024.0 KB)
      

      Attachments

        Issue Links

          Activity

            People

              cloud_fan Wenchen Fan
              tien-dung.le Tien-Dung LE
              Votes:
              2 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: