[SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support Enabled - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.0.2, 3.1.1
Fix Version/s: None
Component/s: PySpark
Labels:
None

Description

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
from pyspark.testing.sqlutils  import ExamplePoint
import pandas as pd
pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])})
df = spark.createDataFrame(pdf)
df.toPandas()

with `spark.sql.execution.arrow.enabled` = false, the above snippet works fine without WARNINGS.

with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine with WARNINGS. Because of Unsupported type in conversion, the Arrow optimization is actually turned off.

Detailed steps to reproduce:

$ bin/pyspark
Python 3.8.8 (default, Feb 24 2021, 13:46:16)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
      /_/

Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
Spark context Web UI available at http://172.30.0.226:4040
Spark context available as 'sc' (master = local[*], app id = local-1615994008526).
SparkSession available as 'spark'.
>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
21/03/17 23:13:31 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.
>>> from pyspark.testing.sqlutils  import ExamplePoint
>>> import pandas as pd
>>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])})
>>> df = spark.createDataFrame(pdf)
/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  Could not convert (1,1) with type ExamplePoint: did not recognize Python value type when inferring an Arrow data type
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warnings.warn(msg)
>>>
>>> df.show()
+----------+
|     point|
+----------+
|(0.0, 0.0)|
|(0.0, 0.0)|
+----------+

>>> df.schema
StructType(List(StructField(point,ExamplePointUDT,true)))
>>> df.toPandas()
/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  Unsupported type in conversion to Arrow: ExamplePointUDT
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warnings.warn(msg)
       point
0  (0.0,0.0)
1  (0.0,0.0)

Attachments

Issue Links

links to

[Github] Pull Request #32026 (sadhen)

[Github] Pull Request #32321 (sadhen)

Activity

People

Assignee:: Unassigned

Reporter:: Darcy Shen

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Mar/21 06:42

Updated:: 25/Apr/21 03:38