Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.5.1
-
None
-
None
Description
Currently, LIMIT runs with a single task in GlobalLimit operator. Many users decide to go around the problem by using SAMPLE as in: https://towardsdatascience.com/stop-using-the-limit-clause-wrong-with-spark-646e328774f5
Spark can be improved to:
1. Do a per-partition count;
2. Include K full partitions, plus 1 partial partitions.
While the 1 partial partition still requires a single task to run, the K full partitions would have dramatically reduced the N (of the LIMIT N) so it's a lot faster.