Secondary sort for Spark RDDs was discussed in https://issues.apache.org/jira/browse/SPARK-3655
Since the RDD API allows for easy extensions outside the core library this was implemented separately here:
However it seems to me that with Dataset an implementation in a 3rd party library of such a feature is not really an option.
Dataset already has methods that suggest a secondary sort is present, such as in KeyValueGroupedDataset:
This operation pushes all the data to the reducer, something you only would want to do if you need the elements in a particular order.
How about as an API sortBy methods in KeyValueGroupedDataset and RelationalGroupedDataset?
(yes i know RelationalGroupedDataset doesnt have a fold yet... but it should )