[SPARK-27296] Efficient User Defined Aggregators - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.3, 2.4.0, 3.0.0
Fix Version/s: 3.0.0
Component/s: Spark Core, SQL, Structured Streaming
Labels:
- performance
- usability

Target Version/s:

3.0.0
Flags:

Important

Description

Spark's UDAFs appear to be serializing and de-serializing to/from the MutableAggregationBuffer for each row. This gist shows a small reproducing UDAF and a spark shell session:

https://gist.github.com/erikerlandson/3c4d8c6345d1521d89e0d894a423046f

The UDAF and its compantion UDT are designed to count the number of times that ser/de is invoked for the aggregator. The spark shell session demonstrates that it is executing ser/de on every row of the data frame.

Note, Spark's pre-defined aggregators do not have this problem, as they are based on an internal aggregating trait that does the correct thing and only calls ser/de at points such as partition boundaries, presenting final results, etc.

This is a major problem for UDAFs, as it means that every UDAF is doing a massive amount of unnecessary work per row, including but not limited to Row object allocations. For a more realistic UDAF having its own non trivial internal structure it is obviously that much worse.

Attachments

Issue Links

is related to

SPARK-30423 Deprecate UserDefinedAggregateFunction

Resolved

relates to

SPARK-30423 Deprecate UserDefinedAggregateFunction

Resolved

links to

GitHub Pull Request #25024

Activity

People

Assignee:: Erik Erlandson

Reporter:: Erik Erlandson

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 27/Mar/19 23:12

Updated:: 15/Jun/20 21:24

Resolved:: 12/Jan/20 07:38