Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2045

Sort-based shuffle implementation

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.1.0
    • Shuffle, Spark Core
    • None

    Description

      Building on the pluggability in SPARK-2044, a sort-based shuffle implementation that takes advantage of an Ordering for keys (or just sorts by hashcode for keys that don't have it) would likely improve performance and memory usage in very large shuffles. Our current hash-based shuffle needs an open file for each reduce task, which can fill up a lot of memory for compression buffers and cause inefficient IO. This would avoid both of those issues.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            matei Matei Alexandru Zaharia
            matei Matei Alexandru Zaharia
            Votes:
            0 Vote for this issue
            Watchers:
            23 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment