Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Won't Do
-
3.0.0
-
None
-
None
Description
Spark currently supports filtered aggregation but does not expose API allowing to use them when using the `spark.sql.functions` package.
It is possible to use them when writing directly SQL:
scala> val df = spark.range(100) scala> df.registerTempTable("df") scala> spark.sql("select count(1) as classic_cnt, count(1) FILTER (WHERE id < 50) from df").show() +-----------+-------------------------------------------------+ |classic_cnt|count(1) FILTER (WHERE (id < CAST(50 AS BIGINT)))| +-----------+-------------------------------------------------+ | 100| 50| +-----------+-------------------------------------------------+
These aggregations are especially useful when filtering on overlapping datasets (where a pivot would not work):
SELECT AVG(revenue) FILTER (WHERE age < 25), AVG(revenue) FILTER (WHERE age < 35), AVG(revenue) FILTER (WHERE age < 45) FROM people;
I did not find an issue tracking this, hence I am creating this one and I will join a PR to illustrate a possible implementation.