[IMPALA-110] Add support for multiple distinct operators in the same query block - ASF JIRA

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: Impala 0.5, Impala 1.4, Impala 2.0, Impala 2.2, Impala 2.3.0
Fix Version/s: Impala 3.1.0
Component/s: Backend, Frontend
Labels:
- sql-language

Target Version:

Product Backlog

Description

Impala only allows a single (DISTINCT columns) expression in each query.

Note:
If you do not need precise accuracy, you can produce an estimate of the distinct values for a column by specifying NDV(column); a query can contain multiple instances of NDV(column). To make Impala automatically rewrite COUNT(DISTINCT) expressions to NDV(), enable the APPX_COUNT_DISTINCT query option.

[impala:21000] > select count(distinct i_class_id) from item;
Query: select count(distinct i_class_id) from item
Query finished, fetching results ...
16
Returned 1 row(s) in 1.51s

[impala:21000] > select count(distinct i_class_id), count(distinct i_brand_id) from item;
Query: select count(distinct i_class_id), count(distinct i_brand_id) from item
ERROR: com.cloudera.impala.common.AnalysisException: Analysis exception (in select count(distinct i_class_id), count(distinct i_brand_id) from item)
	at com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:133)
	at com.cloudera.impala.service.Frontend.createExecRequest(Frontend.java:221)
	at com.cloudera.impala.service.JniFrontend.createExecRequest(JniFrontend.java:89)
Caused by: com.cloudera.impala.common.AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as COUNT(DISTINCT i_class_id); deviating function: COUNT(DISTINCT i_brand_id)
	at com.cloudera.impala.analysis.AggregateInfo.createDistinctAggInfo(AggregateInfo.java:196)
	at com.cloudera.impala.analysis.AggregateInfo.create(AggregateInfo.java:143)
	at com.cloudera.impala.analysis.SelectStmt.createAggInfo(SelectStmt.java:466)
	at com.cloudera.impala.analysis.SelectStmt.analyzeAggregation(SelectStmt.java:347)
	at com.cloudera.impala.analysis.SelectStmt.analyze(SelectStmt.java:155)
	at com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:130)
	... 2 more

Hive supports this:

$ hive -e "select count(distinct i_class_id), count(distinct i_brand_id) from item;"
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
Hive history file=/tmp/grahn/hive_job_log_grahn_201303052234_1625576708.txt
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201302081514_0073, Tracking URL = http://impala:50030/jobdetails.jsp?jobid=job_201302081514_0073
Kill Command = /usr/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=m0525.mtv.cloudera.com:8021 -kill job_201302081514_0073
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2013-03-05 22:34:43,255 Stage-1 map = 0%,  reduce = 0%
2013-03-05 22:34:49,323 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 sec
2013-03-05 22:34:50,337 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 sec
2013-03-05 22:34:51,351 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 sec
2013-03-05 22:34:52,360 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 sec
2013-03-05 22:34:53,370 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 sec
2013-03-05 22:34:54,379 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 sec
2013-03-05 22:34:55,389 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.58 sec
2013-03-05 22:34:56,402 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.58 sec
2013-03-05 22:34:57,413 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.58 sec
2013-03-05 22:34:58,424 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.58 sec
MapReduce Total cumulative CPU time: 8 seconds 580 msec
Ended Job = job_201302081514_0073
MapReduce Jobs Launched: 
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 8.58 sec   HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 580 msec
OK
16	952
Time taken: 25.666 seconds

Attachments

Issue Links

breaks

IMPALA-8790 IllegalStateException: Illegal reference to non-materialized slot

Resolved

causes

IMPALA-7240 TestSpillingDebugActionDimensions.test_spilling_regression_exhaustive hits memory limit on exhaustive tests

Resolved

IMPALA-7251 Fix QueryMaintenance calls in Aggregators

Resolved

IMPALA-7397 dcheck in impala::AggregationNode::Close

Resolved

is depended upon by

IMPALA-7204 Add support for GROUP BY ROLLUP, CUBE and GROUPING SETS

Open

is related to

IMPALA-10445 The ability to adjust NDV's precision with query option

Resolved

(1 is related to)

Sub-Tasks

1.

Impala 3.1 Doc: Add support for multiple distinct operators in the same query block

Closed

Alexandra Rodoni

Add support for multiple distinct operators in the same query block

Details

Description

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates