[SPARK-8908] Calling distinct() with parentheses throws error in Scala DataFrame - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.4.0, 1.5.0
Fix Version/s: 1.5.0
Component/s: SQL
Labels:
- releasenotes

Description

To reproduce, please call distinct() on DataFrame in spark-shell. For eg,

scala> sqlContext.table("my_table").distinct()

<console>:19: error: not enough arguments for method apply: (colName: String)org.apache.spark.sql.Column in class DataFrame.
Unspecified value parameter colName.

This is confusing because distinct in DataFrame is an alias of dropDuplicates, and both dropDuplicates and dropDuplicates() work.

Here is the summary-

Scala code	Works
DF.distinct	Y
DF.distinct()	N
DF.dropDuplicates	Y
DF.dropDuplicates()	Y

Looking at the definition of distinct, it's missing ()-

override def distinct: DataFrame = dropDuplicates()

As a result, what seems happening is as follows-

distinct()
=> dropDuplicates()()
=> DataFrame() // because dropDuplicates() returns DF
=> DataFrame.apply() // fails because apply() takes a column parameter

I can verify that adding () to the definition makes both distinct and distinct() work.

Attachments

Issue Links

links to

[Github] Pull Request #7298 (piaozhexiu)

Activity

People

Assignee:: Cheolsoo Park

Reporter:: Cheolsoo Park

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Jul/15 20:22

Updated:: 08/Jul/15 22:21

Resolved:: 08/Jul/15 22:18