Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-43797 Python User-defined Table Functions
  3. SPARK-44856

Improve Python UDTF arrow serializer performance

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.0, 4.0.0
    • None
    • PySpark
    • None

    Description

      Currently, there is a lot of overhead in the arrow serializer for Python UDTFs. The overhead is largely from converting arrow batches into pandas series and converting UDTF's results back to a pandas dataframe.

      We should try directly converting Python object into arrow and vice versa to avoid the expensive pandas conversion. Similar to this converter: https://github.com/apache/spark/blob/be04ac1ace91f6da34b08a1510e41d3ab6f0377b/python/pyspark/sql/connect/conversion.py#L56

       

      Attachments

        Activity

          People

            m.zhang Michael Zhang
            allisonwang-db Allison Wang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: