[SPARK-37099] Introduce a rank-based filter to optimize top-k computation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.5.0
Component/s: SQL
Labels:
None

Description

in JD, we found that more than 90% usage of window function follows this pattern:

 select (... (row_number|rank|dense_rank) () over( [partition by ...] order by ... ) as rn)
    where rn (==|<|<=) k and other conditions

However, existing physical plan is not optimum:

1, we should select local top-k records within each partitions, and then compute the global top-k. this can help reduce the shuffle amount;

For these three rank functions (row_number|rank|dense_rank), the rank of a key computed on partitial dataset is always <= its final rank computed on the whole dataset. so we can safely discard rows with partitial rank > k, anywhere.

2, skewed-window: some partition is skewed and take a long time to finish computation.

A real-world skewed-window case in our system is attached.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

skewed_window.png
22/Oct/21 10:48
157 kB
Ruifeng Zheng
q67.png
31/Dec/21 11:39
789 kB
Ruifeng Zheng
q67_optimized.png
31/Dec/21 11:39
829 kB
Ruifeng Zheng

Issue Links

links to

[Github] Pull Request #34367 (zhengruifeng)

[Github] Pull Request #38745 (beliefer)

[Github] Pull Request #38799 (beliefer)

[Github] Pull Request #39930 (beliefer)

[Github] Pull Request #40754 (ulysses-you)

(2 links to)

Activity

People

Assignee:: Jiaan Geng

Reporter:: Ruifeng Zheng

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 22/Oct/21 10:47

Updated:: 13/Apr/23 03:47

Resolved:: 21/Feb/23 07:09