Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8908

Calling distinct() with parentheses throws error in Scala DataFrame

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.4.0, 1.5.0
    • 1.5.0
    • SQL

    Description

      To reproduce, please call distinct() on DataFrame in spark-shell. For eg,

      scala> sqlContext.table("my_table").distinct()
      
      <console>:19: error: not enough arguments for method apply: (colName: String)org.apache.spark.sql.Column in class DataFrame.
      Unspecified value parameter colName.
      

      This is confusing because distinct in DataFrame is an alias of dropDuplicates, and both dropDuplicates and dropDuplicates() work.

      Here is the summary-

      Scala code Works
      DF.distinct Y
      DF.distinct() N
      DF.dropDuplicates Y
      DF.dropDuplicates() Y

      Looking at the definition of distinct, it's missing ()-

      override def distinct: DataFrame = dropDuplicates()
      

      As a result, what seems happening is as follows-

      distinct()
      => dropDuplicates()()
      => DataFrame() // because dropDuplicates() returns DF
      => DataFrame.apply() // fails because apply() takes a column parameter
      

      I can verify that adding () to the definition makes both distinct and distinct() work.

      Attachments

        Activity

          People

            cheolsoo Cheolsoo Park
            cheolsoo Cheolsoo Park
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: