[SPARK-36289] rewrite distinct count case when expressions without Expand node - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.2.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Currently, RewriteDistinctAggregates rule will rewrite distinct aggregates with extra Expand node. This causes unnecessary memory waste and performance fallback for some specific aggregates(e.g. count distinct case when).

We can rewrite count distinct case when aggregates without Expand:

for query:

{{ { * SELECT * cat1, * COUNT(DISTINCT CASE WHEN cond1 THEN cat2 ELSE null end) as cat2_cnt1, * COUNT(DISTINCT CASE WHEN cond2 THEN cat2 ELSE null end) as cat2_cnt2, * FROM * data * GROUP BY * key * }
}}

we should rewrite to :

{{ { * Aggregate( * key = ['key] * functions = [count(if (01 & 'bit_vector != 0) 0 else null), * count(if (10 & 'bit_vector != 0) 0 else null)] * output = ['key, 'cat2_cnt1, 'cat2_cnt2]) * Aggregate( * key = ['key, 'cat1] * functions = [bit_or(if (cond1) 01 else 00, if (cond2) 10 else 00)] * output = ['key, 'cat1, 'bit_vector]) * LocalTableScan [...] * }
}}

This method will improve performance and reduce memory waste

Attachments

Issue Links

links to

[Github] Pull Request #33520 (Gabriel39)

Activity

People

Assignee:: Unassigned

Reporter:: Wenqiang Lee

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Jul/21 12:51

Updated:: 26/Jul/21 13:15