[SPARK-27581] DataFrame countDistinct("*") fails with AnalysisException: "Invalid usage of '*' in expression 'count'" - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.0
Fix Version/s: 3.0.0
Component/s: SQL
Labels:
None

Description

If I have a DataFrame then I can use count("*") as an expression, e.g.:

import org.apache.spark.sql.functions._
val df = sql("select id % 100 from range(100000)")
df.select(count("*")).first()

However, if I try to do the same thing with countDistinct I get an error:

import org.apache.spark.sql.functions._
val df = sql("select id % 100 from range(100000)")
df.select(countDistinct("*")).first()

org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'count';

As a workaround, I need to use expr, e.g.

import org.apache.spark.sql.functions._
val df = sql("select id % 100 from range(100000)")
df.select(expr("count(distinct(*))")).first()

You might be wondering "why not just use df.count() or df.distinct().count()?" but in my case I'd ultimately to compute both counts as part of the same aggregation, e.g.

val (cnt, distinctCnt) = df.select(count("*"), countDistinct("*)).as[(Long, Long)].first()

I'm reporting this because it's a minor usability annoyance / surprise for inexperienced Spark users.

Attachments

Issue Links

links to

GitHub Pull Request #24482

Activity

People

Assignee:: L. C. Hsieh

Reporter:: Josh Rosen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 27/Apr/19 02:06

Updated:: 29/Apr/19 13:19

Resolved:: 29/Apr/19 13:18