Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25599

Stateful aggregation in PySpark

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 2.3.0
    • None
    • PySpark
    • None

    Description

      Hi!
       
      From PySpark, I am trying to define a custom aggregator that is accumulating state. Is it possible in Spark 2.3 ?

      AFAIK, it is now possible to define a custom UDAF in PySpark since Spark 2.3 (cf How to define and use a User-Defined Aggregate Function in Spark SQL?), by calling pandas_udf with the PandasUDFType.GROUPED_AGG keyword.

      However given that it is just taking a function as a parameter I don't think it is possible to carry state around during the aggregation with this function.

      From Scala, I see it is possible to have stateful aggregation by either extending UserDefinedAggregateFunction or org.apache.spark.sql.expressions.Aggregator , but is there a similar thing I can do on python-side only?

      If no, is this planned in a future release?

      thanks!

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              vincent-gg Vincent Grosbois
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: