Description
If I have a DataFrame then I can use count("*") as an expression, e.g.:
import org.apache.spark.sql.functions._ val df = sql("select id % 100 from range(100000)") df.select(count("*")).first()
However, if I try to do the same thing with countDistinct I get an error:
import org.apache.spark.sql.functions._ val df = sql("select id % 100 from range(100000)") df.select(countDistinct("*")).first() org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'count';
As a workaround, I need to use expr, e.g.
import org.apache.spark.sql.functions._ val df = sql("select id % 100 from range(100000)") df.select(expr("count(distinct(*))")).first()
You might be wondering "why not just use df.count() or df.distinct().count()?" but in my case I'd ultimately to compute both counts as part of the same aggregation, e.g.
val (cnt, distinctCnt) = df.select(count("*"), countDistinct("*)).as[(Long, Long)].first()
I'm reporting this because it's a minor usability annoyance / surprise for inexperienced Spark users.
Attachments
Issue Links
- links to