Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19355

Use map output statistices to improve global limit's parallelism

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersStop watchingWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • None
    • None
    • SQL
    • None

    Description

      A logical Limit is performed actually by two physical operations LocalLimit and GlobalLimit.

      In most of time, before GlobalLimit, we will perform a shuffle exchange to shuffle data to single partition. When the limit number is very big, we shuffle a lot of data to a single partition and significantly reduce parallelism, except for the cost of shuffling.

      This change tries to perform GlobalLimit without shuffling data to single partition. Instead, we perform the map stage of the shuffling and collect the statistics of the number of rows in each partition. Shuffled data are actually all retrieved locally without from remote executors.

      Once we get the number of output rows in each partition, we only take the required number of rows from the locally shuffled data.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            viirya L. C. Hsieh Assign to me
            viirya L. C. Hsieh

            Dates

              Created:
              Updated:

              Slack

                Issue deployment