Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Duplicate
-
2.4.0
-
None
-
None
Description
I am running the following simple `toPandas` with maxResultSize set to 1mb:
import pyspark.sql.functions as F df = spark.range(1000 * 1000) df_pd = df.withColumn("test", F.lit("this is a long string that should make the resulting dataframe too large for maxResult which is 1m")).toPandas()
With spark.sql.execution.arrow.enabled set to true, this returns an empty Pandas dataframe without any error:
df_pd.info() # <class 'pandas.core.frame.DataFrame'> # Index: 0 entries # Data columns (total 2 columns): # id 0 non-null object # test 0 non-null object # dtypes: object(2) # memory usage: 0.0+ bytes
The driver stderr does have an error, and so does the Spark UI:
ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB) ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) is bigger than spark.driver.maxResultSize (1024.0 KB) Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 1 tasks (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313) at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282) at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435) at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436) at org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432) at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862)
With spark.sql.execution.arrow.enabled set to false, the Python call to toPandas does fail as expected.
Attachments
Issue Links
- duplicates
-
SPARK-28881 toPandas with Arrow should not return a DataFrame when the result size exceeds `spark.driver.maxResultSize`
- Resolved