[SPARK-6006] Optimize count distinct in case of high cardinality columns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.1.1, 1.2.1
Fix Version/s: 1.6.0
Component/s: SQL
Labels:
None

Description

In case there are a lot of distinct values, count distinct becomes too slow since it tries to hash partial results to one map. It can be improved by creating buckets/partial maps in an intermediate stage where same key from multiple partial maps of first stage hash to the same bucket. Later we can sum the size of these buckets to get total distinct count.

Attachments

Issue Links

duplicates

SPARK-12077 Use more robust plan for single distinct aggregation

Resolved

links to

[Github] Pull Request #4764 (saucam)

Activity

People

Assignee:: Davies Liu

Reporter:: Yash Datta

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 25/Feb/15 12:20

Updated:: 17/Dec/15 06:45

Resolved:: 17/Dec/15 06:42