Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-41391

The output column name of `groupBy.agg(count_distinct)` is incorrect

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.2.0, 3.3.0, 3.4.0
    • 3.5.0
    • SQL
    • None

    Description

      scala> val df = spark.range(1, 10).withColumn("value", lit(1))
      df: org.apache.spark.sql.DataFrame = [id: bigint, value: int]

      scala> df.createOrReplaceTempView("table")

      scala> df.groupBy("id").agg(count_distinct($"value"))
      res1: org.apache.spark.sql.DataFrame = [id: bigint, count(value): bigint]

      scala> spark.sql(" SELECT id, COUNT(DISTINCT value) FROM table GROUP BY id ")
      res2: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT value): bigint]

      scala> df.groupBy("id").agg(count_distinct($"*"))
      res3: org.apache.spark.sql.DataFrame = [id: bigint, count(unresolvedstar()): bigint]

      scala> spark.sql(" SELECT id, COUNT(DISTINCT *) FROM table GROUP BY id ")
      res4: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT id, value): bigint]

      Attachments

        Activity

          People

            ritika Ritika Maheshwari
            podongfeng Ruifeng Zheng
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: