[SPARK-40307] Introduce Arrow Python UDFs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Umbrella
Status: Resolved
Priority: Major
Resolution: Done
Affects Version/s: 3.5.0
Fix Version/s: 3.5.0
Component/s: Connect, PySpark
Labels:
None

Description

Python user-defined function (UDF) enables users to run arbitrary code against PySpark columns. It uses Pickle for (de)serialization and executes row by row.

One major performance bottleneck of Python UDFs is (de)serialization, that is, the data interchanging between the worker JVM and the spawned Python subprocess which actually executes the UDF. We should seek an alternative to handle the (de)serialization: Arrow, which is used in the (de)serialization of Pandas UDF already.

There should be two ways to enable/disable the Arrow optimization for Python UDFs:

the Spark configuration `spark.sql.execution.pythonUDF.arrow.enabled`, disabled by default.
the `useArrow` parameter of the `udf` function, None by default.

The Spark configuration takes effect only when `useArrow` is None. Otherwise, `useArrow` decides whether a specific user-defined function is optimized by Arrow or not.

The reason why we introduce these two ways is to provide both a convenient, per-Spark-session control and a finer-grained, per-UDF control of the Arrow optimization for Python UDFs.

Attachments

Issue Links

relates to

SPARK-43543 Standardize Nested Complex DataTypes Support

Resolved

links to

[Github] Pull Request #39384 (xinrong-meng)

Sub-Tasks

1.	Block Arrow Python UDFs	Resolved	Xinrong Meng
2.	Arrow Python UDFs in Spark Connect	Resolved	Xinrong Meng
3.	Introduce `SQL_ARROW_BATCHED_UDF` EvalType for Arrow Python UDFs	Resolved	Xinrong Meng
4.	Support registration of an Arrow Python UDF	Resolved	Xinrong Meng
5.	Non-atomic data type support in Arrow Python UDF	Resolved	Xinrong Meng
6.	Improve ArrayType input support in Arrow Python UDF	Resolved	Xinrong Meng
7.	Explicit Arrow casting for mismatched return type in Arrow Python UDF	Resolved	Xinrong Meng
8.	Import SparkSession in Python UDF only when useArrow is None	Resolved	Xinrong Meng
9.	Arrow Python UDF Use Guide	Resolved	Xinrong Meng
10.	Improve tests and documentation for Arrow Python UDF	Resolved	Xinrong Meng
11.	Arrow python UDFS didn't support UDT as output type	Open	Unassigned

Activity

People

Assignee:: Xinrong Meng

Reporter:: Xinrong Meng

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Sep/22 21:32

Updated:: 25/Aug/23 02:48

Resolved:: 25/Aug/23 02:48