Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8029

ShuffleMapTasks must be robust to concurrent attempts on the same executor

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.4.0
    • Fix Version/s: 1.5.3, 1.6.0
    • Component/s: Spark Core
    • Labels:
      None

      Description

      When stages get retried, a task may have more than one attempt running at the same time, on the same executor. Currently this causes problems for ShuffleMapTasks, since all attempts try to write to the same output files.

      This is finally resolved through https://github.com/apache/spark/pull/9610, which uses the first writer wins approach.

        Issue Links

          Activity

          Hide
          irashid Imran Rashid added a comment -

          Josh Rosen and I discussed this a bit, I'm uploading a doc with a discussion of some alternatives. I am working on exploring the options a little, but would appreciate any feedback on the various options.

          Show
          irashid Imran Rashid added a comment - Josh Rosen and I discussed this a bit, I'm uploading a doc with a discussion of some alternatives. I am working on exploring the options a little, but would appreciate any feedback on the various options.
          Hide
          irashid Imran Rashid added a comment - - edited

          This is a subset of the issues originally reported in SPARK-7308, to have an issue with a smaller scope, but hopefully still large enough to consider the design.

          SPARK-7829 is the "ad-hoc" proposal of the fix for this issue.

          Show
          irashid Imran Rashid added a comment - - edited This is a subset of the issues originally reported in SPARK-7308 , to have an issue with a smaller scope, but hopefully still large enough to consider the design. SPARK-7829 is the "ad-hoc" proposal of the fix for this issue.
          Hide
          apachespark Apache Spark added a comment -

          User 'squito' has created a pull request for this issue:
          https://github.com/apache/spark/pull/6648

          Show
          apachespark Apache Spark added a comment - User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/6648
          Hide
          rxin Reynold Xin added a comment -

          I have retargeted this and downgraded it from Blocker to Critical since it's been there for a while and not a regression.

          Show
          rxin Reynold Xin added a comment - I have retargeted this and downgraded it from Blocker to Critical since it's been there for a while and not a regression.
          Hide
          rxin Reynold Xin added a comment -

          It'd be really good to fix this in 1.6, and maybe even backport it to older branches.

          Imran Rashid Would you have time to give "Executors Commit ShuffleMapOutput: First Attempt Wins" in your design proposal a try? It seems like a much smaller fix needed, and the chance of that fix having problems is pretty low (despite you think it is "optimistic").

          Show
          rxin Reynold Xin added a comment - It'd be really good to fix this in 1.6, and maybe even backport it to older branches. Imran Rashid Would you have time to give "Executors Commit ShuffleMapOutput: First Attempt Wins" in your design proposal a try? It seems like a much smaller fix needed, and the chance of that fix having problems is pretty low (despite you think it is "optimistic").
          Hide
          apachespark Apache Spark added a comment -

          User 'squito' has created a pull request for this issue:
          https://github.com/apache/spark/pull/9214

          Show
          apachespark Apache Spark added a comment - User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/9214
          Hide
          apachespark Apache Spark added a comment -

          User 'davies' has created a pull request for this issue:
          https://github.com/apache/spark/pull/9610

          Show
          apachespark Apache Spark added a comment - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/9610
          Hide
          davies Davies Liu added a comment -

          Issue resolved by pull request 9610
          https://github.com/apache/spark/pull/9610

          Show
          davies Davies Liu added a comment - Issue resolved by pull request 9610 https://github.com/apache/spark/pull/9610
          Hide
          apachespark Apache Spark added a comment -

          User 'davies' has created a pull request for this issue:
          https://github.com/apache/spark/pull/9686

          Show
          apachespark Apache Spark added a comment - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/9686
          Hide
          rxin Reynold Xin added a comment -

          Davies Liu can you update the jira ticket description with the high level approach used in the fix?

          Show
          rxin Reynold Xin added a comment - Davies Liu can you update the jira ticket description with the high level approach used in the fix?

            People

            • Assignee:
              davies Davies Liu
              Reporter:
              irashid Imran Rashid
            • Votes:
              2 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development