Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.5.0, 4.0.0
-
None
-
None
Description
Currently, there is a lot of overhead in the arrow serializer for Python UDTFs. The overhead is largely from converting arrow batches into pandas series and converting UDTF's results back to a pandas dataframe.
We should try directly converting Python object into arrow and vice versa to avoid the expensive pandas conversion. Similar to this converter: https://github.com/apache/spark/blob/be04ac1ace91f6da34b08a1510e41d3ab6f0377b/python/pyspark/sql/connect/conversion.py#L56