[HIVE-10568] Select count(distinct()) can have more optimal execution plan - ASF JIRA

XML

Word

Printable

JSON

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.6.0, 0.7.0, 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.14.0, 1.0.0, 1.1.0
Fix Version/s: 1.2.0
Component/s: CBO, Logical Optimizer
Labels:
None

select count(distinct ss_ticket_number) from store_sales;

can be rewritten as

select count(1) from (select distinct ss_ticket_number from store_sales) a;

which may run upto 3x faster

is blocked by

HIVE-10607 Combination of ReducesinkDedup + TopN optimization yields incorrect result if there are multiple GBY in reducer

is related to

HIVE-10855 Make HIVE-10568 work with Spark [Spark Branch]

HIVE-19283 Select count(distinct()) a couple of times stuck in last reducer

links to

RB request