Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38591

Add sortWithinGroups to KeyValueGroupedDataset

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.4.0
    • SQL
    • None

    Description

      The existing methods KeyValueGroupedDataset.flatMapGroups and KeyValueGroupedDataset.cogroup provide an iterator of rows for each group key. If user code requires those rows in a particular order, that iterator would have to be sorted first, which is against the idea of an iterator in the first place. Methods flatMapGroups and cogroup have the great advantage that they work with groups that are too large to fit into memory of one executor. Sorting them in the user function breaks this property.

      org.apache.spark.sql.KeyValueGroupedDataset:

      Internally, the implementation will spill to disk if any given group is too large to fit into
      memory. However, users must take care to avoid materializing the whole iterator for a group
      (for example, by calling `toList`) unless they are sure that this is possible given the memory
      constraints of their cluster.
      

      The implementations of KeyValueGroupedDataset.flatMapGroups and KeyValueGroupedDataset.cogroup already sort each partition according to the group key. By additionally sorting by some data columns, the iterator can be guaranteed to provide some order.

      New method KeyValueGroupedDataset.sortWithinGroups (similar to Dataset.sortWithinPartitions)would allow to define order within the groups.

      Attachments

        Activity

          People

            apachespark Apache Spark
            enricomi Enrico Minack
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: