Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15798

Secondary sort in Dataset/DataFrame

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • None
    • None
    • SQL

    Description

      Secondary sort for Spark RDDs was discussed in https://issues.apache.org/jira/browse/SPARK-3655
      Since the RDD API allows for easy extensions outside the core library this was implemented separately here:
      https://github.com/tresata/spark-sorted

      However it seems to me that with Dataset an implementation in a 3rd party library of such a feature is not really an option.

      Dataset already has methods that suggest a secondary sort is present, such as in KeyValueGroupedDataset:

      def flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce[U]): Dataset[U]
      

      This operation pushes all the data to the reducer, something you only would want to do if you need the elements in a particular order.

      How about as an API sortBy methods in KeyValueGroupedDataset and RelationalGroupedDataset?

      dataFrame.groupBy("a").sortBy("b").fold(...)
      

      (yes i know RelationalGroupedDataset doesnt have a fold yet... but it should )

      dataset.groupBy(_._1).sortBy(_._3).flatMapGroups(...)
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              koert koert kuipers
              Votes:
              4 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: