Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40307

Introduce Arrow Python UDFs

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Umbrella
    • Status: Resolved
    • Major
    • Resolution: Done
    • 3.5.0
    • 3.5.0
    • Connect, PySpark
    • None

    Description

      Python user-defined function (UDF) enables users to run arbitrary code against PySpark columns. It uses Pickle for (de)serialization and executes row by row.

      One major performance bottleneck of Python UDFs is (de)serialization, that is, the data interchanging between the worker JVM and the spawned Python subprocess which actually executes the UDF. We should seek an alternative to handle the (de)serialization: Arrow, which is used in the (de)serialization of Pandas UDF already.

      There should be two ways to enable/disable the Arrow optimization for Python UDFs:

      • the Spark configuration `spark.sql.execution.pythonUDF.arrow.enabled`, disabled by default.
      • the `useArrow` parameter of the `udf` function, None by default.

      The Spark configuration takes effect only when `useArrow` is None. Otherwise, `useArrow` decides whether a specific user-defined function is optimized by Arrow or not.

      The reason why we introduce these two ways is to provide both a convenient, per-Spark-session control and a finer-grained, per-UDF control of the Arrow optimization for Python UDFs.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            XinrongM Xinrong Meng
            XinrongM Xinrong Meng
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment