Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
1.4.0, 1.5.0
Description
To reproduce, please call distinct() on DataFrame in spark-shell. For eg,
scala> sqlContext.table("my_table").distinct() <console>:19: error: not enough arguments for method apply: (colName: String)org.apache.spark.sql.Column in class DataFrame. Unspecified value parameter colName.
This is confusing because distinct in DataFrame is an alias of dropDuplicates, and both dropDuplicates and dropDuplicates() work.
Here is the summary-
Scala code | Works |
---|---|
DF.distinct | Y |
DF.distinct() | N |
DF.dropDuplicates | Y |
DF.dropDuplicates() | Y |
Looking at the definition of distinct, it's missing ()-
override def distinct: DataFrame = dropDuplicates()
As a result, what seems happening is as follows-
distinct() => dropDuplicates()() => DataFrame() // because dropDuplicates() returns DF => DataFrame.apply() // fails because apply() takes a column parameter
I can verify that adding () to the definition makes both distinct and distinct() work.