Details
-
Improvement
-
Status: In Progress
-
Minor
-
Resolution: Unresolved
-
3.4.0
-
None
-
None
Description
Calling Dataset.groupBy(...).as[K, T] should be preferred over calling Dataset.groupByKey(...) whenever possible. The former allows Catalyst to exploit existing partitioning and ordering of the Dataset, while the latter hides from Catalyst which columns are used to create the keys.
Example:
Calling ds.groupByKey(_.id) hides from Catalyst that column id is the grouping key.
With ds.groupBy($"id").as[Int, V] tells Catalyst that ds is to be grouped by (partitioned and ordered by) column "id".
This fact should be documented. Further, groupByKey methods with Column and String arguments would help to short cut groupByKey.as and avoid the groupBy(func) methods.