Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22216 Improving PySpark/Pandas interoperability
  3. SPARK-25601

Register Grouped aggregate UDF Vectorized UDFs for SQL Statement

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.0
    • Fix Version/s: 2.4.0, 3.0.0
    • Component/s: PySpark, SQL
    • Labels:
      None
    • Target Version/s:

      Description

      Capable of registering grouped aggregate UDsF and then use it in SQL statement.

      For example,

      from pyspark.sql.functions import pandas_udf, PandasUDFType
      
      @pandas_udf("integer", PandasUDFType.GROUPED_AGG)  # doctest: +SKIP
      def sum_udf(v):
          return v.sum()
      
      spark.udf.register("sum_udf", sum_udf)  # doctest: +SKIP
      q = "SELECT sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2"
      spark.sql(q).show()
      
      +-----------+
      |sum_udf(v1)|
      +-----------+
      |          1|
      |          5|
      +-----------+
      

        Attachments

          Activity

            People

            • Assignee:
              hyukjin.kwon Hyukjin Kwon
              Reporter:
              hyukjin.kwon Hyukjin Kwon
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: