Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-110

Add support for multiple distinct operators in the same query block

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Impala 0.5, Impala 1.4, Impala 2.0, Impala 2.2, Impala 2.3.0
    • Fix Version/s: Impala 3.1.0
    • Component/s: Backend, Frontend
    • Labels:

      Description

      Impala only allows a single (DISTINCT columns) expression in each query.

      Note:
      If you do not need precise accuracy, you can produce an estimate of the distinct values for a column by specifying NDV(column); a query can contain multiple instances of NDV(column). To make Impala automatically rewrite COUNT(DISTINCT) expressions to NDV(), enable the APPX_COUNT_DISTINCT query option.

      [impala:21000] > select count(distinct i_class_id) from item;
      Query: select count(distinct i_class_id) from item
      Query finished, fetching results ...
      16
      Returned 1 row(s) in 1.51s
      
      [impala:21000] > select count(distinct i_class_id), count(distinct i_brand_id) from item;
      Query: select count(distinct i_class_id), count(distinct i_brand_id) from item
      ERROR: com.cloudera.impala.common.AnalysisException: Analysis exception (in select count(distinct i_class_id), count(distinct i_brand_id) from item)
      	at com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:133)
      	at com.cloudera.impala.service.Frontend.createExecRequest(Frontend.java:221)
      	at com.cloudera.impala.service.JniFrontend.createExecRequest(JniFrontend.java:89)
      Caused by: com.cloudera.impala.common.AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as COUNT(DISTINCT i_class_id); deviating function: COUNT(DISTINCT i_brand_id)
      	at com.cloudera.impala.analysis.AggregateInfo.createDistinctAggInfo(AggregateInfo.java:196)
      	at com.cloudera.impala.analysis.AggregateInfo.create(AggregateInfo.java:143)
      	at com.cloudera.impala.analysis.SelectStmt.createAggInfo(SelectStmt.java:466)
      	at com.cloudera.impala.analysis.SelectStmt.analyzeAggregation(SelectStmt.java:347)
      	at com.cloudera.impala.analysis.SelectStmt.analyze(SelectStmt.java:155)
      	at com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:130)
      	... 2 more
      

      Hive supports this:

      $ hive -e "select count(distinct i_class_id), count(distinct i_brand_id) from item;"
      Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
      Hive history file=/tmp/grahn/hive_job_log_grahn_201303052234_1625576708.txt
      Total MapReduce jobs = 1
      Launching Job 1 out of 1
      Number of reduce tasks determined at compile time: 1
      In order to change the average load for a reducer (in bytes):
        set hive.exec.reducers.bytes.per.reducer=<number>
      In order to limit the maximum number of reducers:
        set hive.exec.reducers.max=<number>
      In order to set a constant number of reducers:
        set mapred.reduce.tasks=<number>
      Starting Job = job_201302081514_0073, Tracking URL = http://impala:50030/jobdetails.jsp?jobid=job_201302081514_0073
      Kill Command = /usr/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=m0525.mtv.cloudera.com:8021 -kill job_201302081514_0073
      Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
      2013-03-05 22:34:43,255 Stage-1 map = 0%,  reduce = 0%
      2013-03-05 22:34:49,323 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 sec
      2013-03-05 22:34:50,337 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 sec
      2013-03-05 22:34:51,351 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 sec
      2013-03-05 22:34:52,360 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 sec
      2013-03-05 22:34:53,370 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 sec
      2013-03-05 22:34:54,379 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 sec
      2013-03-05 22:34:55,389 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.58 sec
      2013-03-05 22:34:56,402 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.58 sec
      2013-03-05 22:34:57,413 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.58 sec
      2013-03-05 22:34:58,424 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.58 sec
      MapReduce Total cumulative CPU time: 8 seconds 580 msec
      Ended Job = job_201302081514_0073
      MapReduce Jobs Launched: 
      Job 0: Map: 1  Reduce: 1   Cumulative CPU: 8.58 sec   HDFS Read: 0 HDFS Write: 0 SUCCESS
      Total MapReduce CPU Time Spent: 8 seconds 580 msec
      OK
      16	952
      Time taken: 25.666 seconds
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                twmarshall Thomas Tauber-Marshall
                Reporter:
                grahn Greg Rahn
              • Votes:
                37 Vote for this issue
                Watchers:
                50 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: