[SPARK-17949] Introduce a JVM object based aggregate operator - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.2.0
Component/s: SQL
Labels:
- releasenotes

Target Version/s:

2.2.0

Description

The new Tungsten execution engine has very robust memory management and speed for simple data types. It does, however, suffer from the following:

For user-defined aggregates (Hive UDAFs, Dataset typed operators), it is fairly expensive to fit into the Tungsten internal format.
For aggregate functions that require complex intermediate data structures, Unsafe (on raw bytes) is not a good programming abstraction due to the lack of structs.

The idea here is to introduce a JVM object based hash aggregate operator that can support the aforementioned use cases. This operator, however, should limit its memory usage to avoid putting too much pressure on GC, e.g. falling back to sort-based aggregate as soon the number of objects exceeds a very low threshold.

Internally at Databricks we prototyped a version of this for a customer POC and have observed substantial speed-ups over existing Spark.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

[Design Doc] Support for Arbitrary Aggregation States.pdf
19/Oct/16 21:47
84 kB
Cheng Lian

Issue Links

links to

[Github] Pull Request #15590 (liancheng)

Activity

People

Assignee:: Cheng Lian

Reporter:: Reynold Xin

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 14/Oct/16 23:48

Updated:: 28/Jan/18 22:58

Resolved:: 03/Nov/16 16:35