Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33196

Expose filtered aggregation API

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Won't Do
    • 3.0.0
    • None
    • SQL
    • None

    Description

      Spark currently supports filtered aggregation but does not expose API allowing to use them when using the `spark.sql.functions` package.

      It is possible to use them when writing directly SQL:

      scala> val df = spark.range(100)
      scala> df.registerTempTable("df")
      scala> spark.sql("select count(1) as classic_cnt, count(1) FILTER (WHERE id < 50) from df").show()
      +-----------+-------------------------------------------------+ 
      |classic_cnt|count(1) FILTER (WHERE (id < CAST(50 AS BIGINT)))|
      +-----------+-------------------------------------------------+
      |        100|                                               50|
      +-----------+-------------------------------------------------+

      These aggregations are especially useful when filtering on overlapping datasets (where a pivot would not work):

      SELECT 
       AVG(revenue) FILTER (WHERE age < 25),
       AVG(revenue) FILTER (WHERE age < 35),
       AVG(revenue) FILTER (WHERE age < 45)
      FROM people;

      I did not find an issue tracking this, hence I am creating this one and I will join a PR to illustrate a possible implementation.

      Attachments

        Activity

          People

            Unassigned Unassigned
            tools4origins Erwan Guyomarc'h
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: