Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27581

DataFrame countDistinct("*") fails with AnalysisException: "Invalid usage of '*' in expression 'count'"

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0
    • 3.0.0
    • SQL
    • None

    Description

      If I have a DataFrame then I can use count("*") as an expression, e.g.:

      import org.apache.spark.sql.functions._
      val df = sql("select id % 100 from range(100000)")
      df.select(count("*")).first()
      

      However, if I try to do the same thing with countDistinct I get an error:

      import org.apache.spark.sql.functions._
      val df = sql("select id % 100 from range(100000)")
      df.select(countDistinct("*")).first()
      
      org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'count';
      

      As a workaround, I need to use expr, e.g.

      import org.apache.spark.sql.functions._
      val df = sql("select id % 100 from range(100000)")
      df.select(expr("count(distinct(*))")).first()
      

      You might be wondering "why not just use df.count() or df.distinct().count()?" but in my case I'd ultimately to compute both counts as part of the same aggregation, e.g.

      val (cnt, distinctCnt) = df.select(count("*"), countDistinct("*)).as[(Long, Long)].first()
      

      I'm reporting this because it's a minor usability annoyance / surprise for inexperienced Spark users.

      Attachments

        Issue Links

          Activity

            People

              viirya L. C. Hsieh
              joshrosen Josh Rosen
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: