[SPARK-40830] Dataset.groupBy.as should be preferred over Dataset.groupByKey - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.4.0
Fix Version/s: None
Component/s: Documentation, SQL
Labels:
None

Description

Calling Dataset.groupBy(...).as[K, T] should be preferred over calling Dataset.groupByKey(...) whenever possible. The former allows Catalyst to exploit existing partitioning and ordering of the Dataset, while the latter hides from Catalyst which columns are used to create the keys.

Example:

Calling ds.groupByKey(_.id) hides from Catalyst that column id is the grouping key.
With ds.groupBy($"id").as[Int, V] tells Catalyst that ds is to be grouped by (partitioned and ordered by) column "id".

This fact should be documented. Further, groupByKey methods with Column and String arguments would help to short cut groupByKey.as and avoid the groupBy(func) methods.

Attachments

Issue Links

links to

[Github] Pull Request #38296 (EnricoMi)

Activity

People

Assignee:: Unassigned

Reporter:: Enrico Minack

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 18/Oct/22 07:43

Updated:: 18/Oct/22 08:11