[SPARK-44856] Improve Python UDTF arrow serializer performance - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.5.0, 4.0.0
Fix Version/s: None
Component/s: PySpark
Labels:
None

Description

Currently, there is a lot of overhead in the arrow serializer for Python UDTFs. The overhead is largely from converting arrow batches into pandas series and converting UDTF's results back to a pandas dataframe.

We should try directly converting Python object into arrow and vice versa to avoid the expensive pandas conversion. Similar to this converter: https://github.com/apache/spark/blob/be04ac1ace91f6da34b08a1510e41d3ab6f0377b/python/pyspark/sql/connect/conversion.py#L56

Attachments

Activity

People

Assignee:: Michael Zhang

Reporter:: Allison Wang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 17/Aug/23 19:09

Updated:: 22/Aug/23 01:39