Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31635

Spark SQL Sort fails when sorting big data points

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.3.2
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:
      None

      Description

       Please have a look at the example below: 

      case class Point(x:Double, y:Double)
      case class Nested(a: Long, b: Seq[Point])
      val test = spark.sparkContext.parallelize((1L to 100L).map(a => Nested(a,Seq.fill[Point](250000)(Point(1,2)))), 100)
      test.toDF().as[Nested].sort("a").take(1)
      

       Sorting big data objects using Spark Dataframe is failing with following exception: 

      2020-05-04 08:01:00 ERROR TaskSetManager:70 - Total size of serialized results of 14 tasks (107.8 MB) is bigger than spark.driver.maxResultSize (100.0 MB)
      [Stage 0:======>                                                 (12 + 3) / 100]org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 13 tasks (100.1 MB) is bigger than spark.driver.maxResu
      

      However using the RDD API is working and no exception is thrown: 

      case class Point(x:Double, y:Double)
      case class Nested(a: Long, b: Seq[Point])
      val test = spark.sparkContext.parallelize((1L to 100L).map(a => Nested(a,Seq.fill[Point](250000)(Point(1,2)))), 100)
      test.sortBy(_.a).take(1)
      

      For both code snippets we started the spark shell with exactly the same arguments:

      spark-shell --driver-memory 6G --conf "spark.driver.maxResultSize=100MB"
      

      Even if we increase the spark.driver.maxResultSize, the executors still get killed for our use case. The interesting thing is that when using the RDD API directly the problem is not there. Looks like there is a bug in dataframe sort because is shuffling too much data to the driver? 

      Note: this is a small example and I reduced the spark.driver.maxResultSize to a smaller size, but in our application I've tried setting it to 8GB but as mentioned above the job was killed. 

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                george21 George George
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated: