Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40830

Dataset.groupBy.as should be preferred over Dataset.groupByKey

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Minor
    • Resolution: Unresolved
    • 3.4.0
    • None
    • Documentation, SQL
    • None

    Description

      Calling Dataset.groupBy(...).as[K, T] should be preferred over calling Dataset.groupByKey(...) whenever possible. The former allows Catalyst to exploit existing partitioning and ordering of the Dataset, while the latter hides from Catalyst which columns are used to create the keys.

      Example:

      Calling ds.groupByKey(_.id) hides from Catalyst that column id is the grouping key.
      With ds.groupBy($"id").as[Int, V] tells Catalyst that ds is to be grouped by (partitioned and ordered by) column "id".

      This fact should be documented. Further, groupByKey methods with Column and String arguments would help to short cut groupByKey.as and avoid the groupBy(func) methods.

      Attachments

        Activity

          People

            Unassigned Unassigned
            enricomi Enrico Minack
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: