Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23705

dataframe.groupBy() may inadvertently receive sequence of non-distinct strings

    XMLWordPrintableJSON

Details

    Description

      // code placeholder
      package org.apache.spark.sql
      .
      .
      .
      class Dataset[T] private[sql](
      .
      .
      .
      def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {
        val colNames: Seq[String] = col1 +: cols
        RelationalGroupedDataset(
          toDF(), colNames.map(colName => resolve(colName)), RelationalGroupedDataset.GroupByType)
      }
      

      should append a `.distinct` after `colNames` when used in `groupBy` 

       

      Not sure if the community agrees with this or it's up to the users to perform the distinct operation

      Attachments

        Activity

          People

            Unassigned Unassigned
            khoatrantan2000 Khoa Tran
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified