[SPARK-23705] dataframe.groupBy() may inadvertently receive sequence of non-distinct strings - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.3.0
Fix Version/s: None
Component/s: SQL
Labels:
- beginner
- bulk-closed
- easyfix
- features
- newbie
- starter

Description

// code placeholder
package org.apache.spark.sql
.
.
.
class Dataset[T] private[sql](
.
.
.
def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {
  val colNames: Seq[String] = col1 +: cols
  RelationalGroupedDataset(
    toDF(), colNames.map(colName => resolve(colName)), RelationalGroupedDataset.GroupByType)
}

should append a `.distinct` after `colNames` when used in `groupBy`

Not sure if the community agrees with this or it's up to the users to perform the distinct operation

Attachments

Issue Links

links to

[Github] Pull Request #20917 (vinodkc)

[Github] Pull Request #20947 (vinodkc)

Activity

People

Assignee:: Unassigned

Reporter:: Khoa Tran

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 16/Mar/18 03:50

Updated:: 12/Dec/22 18:10

Resolved:: 08/Oct/19 05:42

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified