[SPARK-48013] LIMIT can be improved - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.5.1
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

Currently, LIMIT runs with a single task in GlobalLimit operator. Many users decide to go around the problem by using SAMPLE as in: https://towardsdatascience.com/stop-using-the-limit-clause-wrong-with-spark-646e328774f5

Spark can be improved to:
1. Do a per-partition count;

2. Include K full partitions, plus 1 partial partitions.

While the 1 partial partition still requires a single task to run, the K full partitions would have dramatically reduced the N (of the LIMIT N) so it's a lot faster.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Zheng Shao

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Apr/24 19:45

Updated:: 26/Apr/24 19:45