Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
None
-
None
Description
Secondary sort for Spark RDDs was discussed in https://issues.apache.org/jira/browse/SPARK-3655
Since the RDD API allows for easy extensions outside the core library this was implemented separately here:
https://github.com/tresata/spark-sorted
However it seems to me that with Dataset an implementation in a 3rd party library of such a feature is not really an option.
Dataset already has methods that suggest a secondary sort is present, such as in KeyValueGroupedDataset:
def flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce[U]): Dataset[U]
This operation pushes all the data to the reducer, something you only would want to do if you need the elements in a particular order.
How about as an API sortBy methods in KeyValueGroupedDataset and RelationalGroupedDataset?
dataFrame.groupBy("a").sortBy("b").fold(...)
(yes i know RelationalGroupedDataset doesnt have a fold yet... but it should )
dataset.groupBy(_._1).sortBy(_._3).flatMapGroups(...)
Attachments
Issue Links
- is related to
-
SPARK-3655 Support sorting of values in addition to keys (i.e. secondary sort)
- Resolved