[SPARK-25599] Stateful aggregation in PySpark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: 2.3.0
Fix Version/s: None
Component/s: PySpark
Labels:
None

Description

Hi!

From PySpark, I am trying to define a custom aggregator that is accumulating state. Is it possible in Spark 2.3 ?

AFAIK, it is now possible to define a custom UDAF in PySpark since Spark 2.3 (cf How to define and use a User-Defined Aggregate Function in Spark SQL?), by calling pandas_udf with the PandasUDFType.GROUPED_AGG keyword.

However given that it is just taking a function as a parameter I don't think it is possible to carry state around during the aggregation with this function.

From Scala, I see it is possible to have stateful aggregation by either extending UserDefinedAggregateFunction or org.apache.spark.sql.expressions.Aggregator , but is there a similar thing I can do on python-side only?

If no, is this planned in a future release?

thanks!

Attachments

Issue Links

duplicates

SPARK-10915 Add support for UDAFs in Python

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Vincent Grosbois

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 02/Oct/18 20:24

Updated:: 12/Dec/22 18:10

Resolved:: 21/Nov/18 10:12