Details
-
Sub-task
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
3.0.2, 3.1.1
-
None
-
None
Description
spark.conf.set("spark.sql.execution.arrow.enabled", "true") from pyspark.testing.sqlutils import ExamplePoint import pandas as pd pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])}) df = spark.createDataFrame(pdf) df.toPandas()
with `spark.sql.execution.arrow.enabled` = false, the above snippet works fine without WARNINGS.
with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine with WARNINGS. Because of Unsupported type in conversion, the Arrow optimization is actually turned off.
Detailed steps to reproduce:
$ bin/pyspark Python 3.8.8 (default, Feb 24 2021, 13:46:16) [Clang 10.0.0 ] :: Anaconda, Inc. on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.2.0-SNAPSHOT /_/ Using Python version 3.8.8 (default, Feb 24 2021 13:46:16) Spark context Web UI available at http://172.30.0.226:4040 Spark context available as 'sc' (master = local[*], app id = local-1615994008526). SparkSession available as 'spark'. >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") 21/03/17 23:13:31 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it. >>> from pyspark.testing.sqlutils import ExamplePoint >>> import pandas as pd >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])}) >>> df = spark.createDataFrame(pdf) /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Could not convert (1,1) with type ExamplePoint: did not recognize Python value type when inferring an Arrow data type Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warnings.warn(msg) >>> >>> df.show() +----------+ | point| +----------+ |(0.0, 0.0)| |(0.0, 0.0)| +----------+ >>> df.schema StructType(List(StructField(point,ExamplePointUDT,true))) >>> df.toPandas() /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Unsupported type in conversion to Arrow: ExamplePointUDT Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warnings.warn(msg) point 0 (0.0,0.0) 1 (0.0,0.0)