Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: None
    • Component/s: PySpark, SQL
    • Labels:
      None

      Description

      Using Arrow, Python UDFs can be evaluated in vectorized form by using the column data as Pandas.Series. This will offer a performance gain by computing the return column data in one operation instead of iterating over each row to calculate a single element and appending to a list, as is currently done. The existing Python UDF api can be used to implement this, which specifies the return type, and since not all functions may be able to be vectorized there would need to be a way to enable this optimizaiton, such as a SQLConf.

      This is designed as a preliminary step for the existing SPIP: Vectorized UDFs in Python SPARK-21190 that could be used as a basis for whatever expanded API is decided upon there.

        Issue Links

          Activity

          Hide
          bryanc Bryan Cutler added a comment -

          I'll submit the work I've done so far as a WIP PR and open to discussion in using this as a first step to an expanded API for vectorized UDFs in SPARK-21190

          Show
          bryanc Bryan Cutler added a comment - I'll submit the work I've done so far as a WIP PR and open to discussion in using this as a first step to an expanded API for vectorized UDFs in SPARK-21190
          Hide
          apachespark Apache Spark added a comment -

          User 'BryanCutler' has created a pull request for this issue:
          https://github.com/apache/spark/pull/18659

          Show
          apachespark Apache Spark added a comment - User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/18659
          Hide
          bryanc Bryan Cutler added a comment -

          This has been merged as SPARK-21190

          Show
          bryanc Bryan Cutler added a comment - This has been merged as SPARK-21190

            People

            • Assignee:
              Unassigned
              Reporter:
              bryanc Bryan Cutler
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development