Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27581

DataFrame countDistinct("*") fails with AnalysisException: "Invalid usage of '*' in expression 'count'"

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.0
    • Fix Version/s: 3.0.0
    • Component/s: SQL
    • Labels:
      None

      Description

      If I have a DataFrame then I can use count("*") as an expression, e.g.:

      import org.apache.spark.sql.functions._
      val df = sql("select id % 100 from range(100000)")
      df.select(count("*")).first()
      

      However, if I try to do the same thing with countDistinct I get an error:

      import org.apache.spark.sql.functions._
      val df = sql("select id % 100 from range(100000)")
      df.select(countDistinct("*")).first()
      
      org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'count';
      

      As a workaround, I need to use expr, e.g.

      import org.apache.spark.sql.functions._
      val df = sql("select id % 100 from range(100000)")
      df.select(expr("count(distinct(*))")).first()
      

      You might be wondering "why not just use df.count() or df.distinct().count()?" but in my case I'd ultimately to compute both counts as part of the same aggregation, e.g.

      val (cnt, distinctCnt) = df.select(count("*"), countDistinct("*)).as[(Long, Long)].first()
      

      I'm reporting this because it's a minor usability annoyance / surprise for inexperienced Spark users.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                viirya L. C. Hsieh
                Reporter:
                joshrosen Josh Rosen
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: