Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27039

toPandas with Arrow swallows maxResultSize errors

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: 2.4.0
    • Fix Version/s: None
    • Component/s: PySpark
    • Labels:
      None

      Description

      I am running the following simple `toPandas` with maxResultSize set to 1mb:

      import pyspark.sql.functions as F
      df = spark.range(1000 * 1000)
      df_pd = df.withColumn("test", F.lit("this is a long string that should make the resulting dataframe too large for maxResult which is 1m")).toPandas()
      

       
      With spark.sql.execution.arrow.enabled set to true, this returns an empty Pandas dataframe without any error:

      df_pd.info()
      
      # <class 'pandas.core.frame.DataFrame'>
      # Index: 0 entries
      # Data columns (total 2 columns):
      # id      0 non-null object
      # test    0 non-null object
      # dtypes: object(2)
      # memory usage: 0.0+ bytes
      

      The driver stderr does have an error, and so does the Spark UI:

      ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB)
      ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) is bigger than spark.driver.maxResultSize (1024.0 KB)
      
      Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 1 tasks (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB)
       at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039)
       at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027)
       at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026)
       at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
       at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
       at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026)
       at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
       at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
       at scala.Option.foreach(Option.scala:257)
       at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
       at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260)
       at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209)
       at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198)
       at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
       at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
       at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
       at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313)
       at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282)
       at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435)
       at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
       at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
       at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
       at org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436)
       at org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432)
       at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862)
      

      With spark.sql.execution.arrow.enabled set to false, the Python call to toPandas does fail as expected.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                peay peay
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: