Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22216 Improving PySpark/Pandas interoperability
  3. SPARK-25601

Register Grouped aggregate UDF Vectorized UDFs for SQL Statement

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0
    • 2.4.0, 3.0.0
    • PySpark, SQL
    • None

    Description

      Capable of registering grouped aggregate UDsF and then use it in SQL statement.

      For example,

      from pyspark.sql.functions import pandas_udf, PandasUDFType
      
      @pandas_udf("integer", PandasUDFType.GROUPED_AGG)  # doctest: +SKIP
      def sum_udf(v):
          return v.sum()
      
      spark.udf.register("sum_udf", sum_udf)  # doctest: +SKIP
      q = "SELECT sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2"
      spark.sql(q).show()
      
      +-----------+
      |sum_udf(v1)|
      +-----------+
      |          1|
      |          5|
      +-----------+
      

      Attachments

        Activity

          People

            hyukjin.kwon Hyukjin Kwon
            hyukjin.kwon Hyukjin Kwon
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: