[SPARK-38591] Add sortWithinGroups to KeyValueGroupedDataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.0
Component/s: SQL
Labels:
None

Description

The existing methods KeyValueGroupedDataset.flatMapGroups and KeyValueGroupedDataset.cogroup provide an iterator of rows for each group key. If user code requires those rows in a particular order, that iterator would have to be sorted first, which is against the idea of an iterator in the first place. Methods flatMapGroups and cogroup have the great advantage that they work with groups that are too large to fit into memory of one executor. Sorting them in the user function breaks this property.

org.apache.spark.sql.KeyValueGroupedDataset:

Internally, the implementation will spill to disk if any given group is too large to fit into
memory. However, users must take care to avoid materializing the whole iterator for a group
(for example, by calling `toList`) unless they are sure that this is possible given the memory
constraints of their cluster.

The implementations of KeyValueGroupedDataset.flatMapGroups and KeyValueGroupedDataset.cogroup already sort each partition according to the group key. By additionally sorting by some data columns, the iterator can be guaranteed to provide some order.

New method KeyValueGroupedDataset.sortWithinGroups (similar to Dataset.sortWithinPartitions)would allow to define order within the groups.