Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-48013

LIMIT can be improved

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.1
    • None
    • Spark Core
    • None

    Description

      Currently, LIMIT runs with a single task in GlobalLimit operator.  Many users decide to go around the problem by using SAMPLE as in: https://towardsdatascience.com/stop-using-the-limit-clause-wrong-with-spark-646e328774f5

       

      Spark can be improved to:
      1. Do a per-partition count;

      2. Include K full partitions, plus 1 partial partitions.

      While the 1 partial partition still requires a single task to run, the K full partitions would have dramatically reduced the N (of the LIMIT N) so it's a lot faster.  

      Attachments

        Activity

          People

            Unassigned Unassigned
            zshao Zheng Shao
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: