Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30212

COUNT(DISTINCT) window function should be supported

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None
    • Environment:

      Spark 2.4.4

      Scala 2.11.12

      Hive 2.3.6

      Description

      Suppose we have a typical table in Hive like below:

      CREATE TABLE DEMO_COUNT_DISTINCT (
      demo_date string,
      demo_id string
      );
      
      +--------------------------------+------------------------------+
      | demo_count_distinct.demo_date | demo_count_distinct.demo_id |
      +--------------------------------+------------------------------+
      | 20180301 | 101 |
      | 20180301 | 102 |
      | 20180301 | 103 |
      | 20180401 | 201 |
      | 20180401 | 202 |
      +--------------------------------+------------------------------+
      

      Now I want to count distinct number of DEMO_DATE but also reserve every columns' data in each row.
      So I use COUNT(DISTINCT) window function (which is also common in other mainstream databases like Oracle) in Hive beeline and it work:

      SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
       FROM DEMO_COUNT_DISTINCT T;
      
      +--------------+------------+-------------+
      | t.demo_date | t.demo_id | uniq_dates |
      +--------------+------------+-------------+
      | 20180401 | 202 | 2 |
      | 20180401 | 201 | 2 |
      | 20180301 | 103 | 2 |
      | 20180301 | 102 | 2 |
      | 20180301 | 101 | 2 |
      +--------------+------------+-------------+
      

      But when I came to SparkSQL, it threw exception even if I run the same SQL.

      spark.sql("""
      SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
       FROM DEMO_COUNT_DISTINCT T
      """).show
      
      org.apache.spark.sql.AnalysisException: Distinct window functions are not supported: count(distinct DEMO_DATE#1) windowspecdefinition(null, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$()));;
      Project [demo_date#1, demo_id#2, UNIQ_DATES#0L]
      +- Project [demo_date#1, demo_id#2, UNIQ_DATES#0L, UNIQ_DATES#0L]
       +- Window [count(distinct DEMO_DATE#1) windowspecdefinition(null, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS UNIQ_DATES#0L], [null]
       +- Project [demo_date#1, demo_id#2]
       +- SubqueryAlias `T`
       +- SubqueryAlias `default`.`demo_count_distinct`
       +- HiveTableRelation `default`.`demo_count_distinct`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [demo_date#1, demo_id#2]
      

      Then I try to use countDistinct function but also got exceptions.

      spark.sql("""
      SELECT T.*, countDistinct(T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES
       FROM DEMO_COUNT_DISTINCT T
      """).show
      
      org.apache.spark.sql.AnalysisException: Undefined function: 'countDistinct'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 2 pos 12
       at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279)
       at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279)
       at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
       ......
      

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              kernelforce Kernel Force
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: