[SPARK-15798] Secondary sort in Dataset/DataFrame - ASF JIRA

XML

Word

Printable

JSON

Secondary sort for Spark RDDs was discussed in https://issues.apache.org/jira/browse/SPARK-3655
Since the RDD API allows for easy extensions outside the core library this was implemented separately here:
https://github.com/tresata/spark-sorted

However it seems to me that with Dataset an implementation in a 3rd party library of such a feature is not really an option.

Dataset already has methods that suggest a secondary sort is present, such as in KeyValueGroupedDataset:

def flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce[U]): Dataset[U]

This operation pushes all the data to the reducer, something you only would want to do if you need the elements in a particular order.

How about as an API sortBy methods in KeyValueGroupedDataset and RelationalGroupedDataset?

dataFrame.groupBy("a").sortBy("b").fold(...)

(yes i know RelationalGroupedDataset doesnt have a fold yet... but it should )

dataset.groupBy(_._1).sortBy(_._3).flatMapGroups(...)

is related to

SPARK-3655 Support sorting of values in addition to keys (i.e. secondary sort)