Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6957

shuffle hangs after a node manager connection timeout

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 3.0.0-beta1, 2.7.5, 2.8.3
    • Component/s: mrv2
    • Labels:
      None

      Description

      After a connection failure from the reducer to the node manager, shuffles started to hang with the following message:

      org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 - MergeManager returned status WAIT ...
      

      There are two problems that leads to the hang.

      Problem 1.
      When a reducer has an issue connecting to the node manager, copyFromHost may call putBackKnownMapOutput on the same task attempt multiple times.

      There are two call sites of putBackKnownMapOutput in copyFromHost since MAPREDUCE-6303:
      1. In the finally block of copyFromHost
      2. In the catch block of openShuffleUrl.

      When openShuffleUrl fails to connect from the catch block in copyFromHost, it returns null.
      By the time openShuffleUrl returns null, putBackKnownMapOutput would have been called already for all remaining map outputs.
      However, the finally block calls putBackKnownMapOutput one more time on the map outputs.

      Problem 2. Problem 1 causes a leak in MergeManager.
      The problem occurs when multiple fetchers get the same set of map attempt outputs to fetch.
      Different fetchers reserves memory from MergeManager in Fetcher.copyMapOutput for the same map outputs.
      When the fetch succeeds, only the first map output gets committed through ShuffleSchedulerImpl.copySucceeded -> InMemoryMapOutput.commit, because commit() is gated by !finishedMaps[mapIndex].
      This may lead to a condition where usedMemory > memoryLimit, while commitMemory < mergeThreshold.
      This gets the MergeManager into a deadlock where a merge is never triggered while MergeManager cannot reserve additional space for map outputs.

      1. MAPREDUCE-6957.001.patch
        3 kB
        Jooseong Kim
      2. MAPREDUCE-6957.002.patch
        9 kB
        Jooseong Kim
      3. MAPREDUCE-6957.003.patch
        9 kB
        Jooseong Kim

        Issue Links

          Activity

          Hide
          jooseong Jooseong Kim added a comment -

          The patch removes the call to putBackKnownMapOutput from openShuffleUrl and leaves only one call site in copyFromHost.

          Show
          jooseong Jooseong Kim added a comment - The patch removes the call to putBackKnownMapOutput from openShuffleUrl and leaves only one call site in copyFromHost.
          Hide
          daemeonr@gmail.com daemeon reiydelle added a comment -

          I always wondered what you were doing buried in the Java code every time I
          walked up. Thank you for your hard work! It has made supporting Hadoop at
          scale so much easier.

          Daemeon C.M. ReiydelleSan Francisco 1.415.501.0198London 44 020 8144 9872

          On Mon, Sep 11, 2017 at 11:26 AM, Jooseong Kim (JIRA) <jira@apache.org>

          Show
          daemeonr@gmail.com daemeon reiydelle added a comment - I always wondered what you were doing buried in the Java code every time I walked up. Thank you for your hard work! It has made supporting Hadoop at scale so much easier. Daemeon C.M. ReiydelleSan Francisco 1.415.501.0198London 44 020 8144 9872 On Mon, Sep 11, 2017 at 11:26 AM, Jooseong Kim (JIRA) <jira@apache.org>
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for the report and the patch!

          When the fetch succeeds, only the first map output gets committed through ShuffleSchedulerImpl.copySucceeded -> InMemoryMapOutput.commit, because commit() is gated by !finishedMaps[mapIndex].

          This looks like another latent bug. If, for whatever reason, we try to report a fetch completed for a map that has already completed fetching then it should call output.abort() so we unreserve the memory. Even with the redundant fetching caused by the double put-back of known map outputs, that unreserve fix would have prevented the merge manager hang.

          Would you mind updating the patch to address the missing unreserve? The rest of the patch looks good to me.

          Show
          jlowe Jason Lowe added a comment - Thanks for the report and the patch! When the fetch succeeds, only the first map output gets committed through ShuffleSchedulerImpl.copySucceeded -> InMemoryMapOutput.commit, because commit() is gated by !finishedMaps[mapIndex]. This looks like another latent bug. If, for whatever reason, we try to report a fetch completed for a map that has already completed fetching then it should call output.abort() so we unreserve the memory. Even with the redundant fetching caused by the double put-back of known map outputs, that unreserve fix would have prevented the merge manager hang. Would you mind updating the patch to address the missing unreserve? The rest of the patch looks good to me.
          Hide
          jooseong Jooseong Kim added a comment -

          Thank you for the review and pointing out the other bug Jason Lowe. I added a call to output.abort() for the case where a fetch is completing for a finished map. Please let me know what you think.

          Show
          jooseong Jooseong Kim added a comment - Thank you for the review and pointing out the other bug Jason Lowe . I added a call to output.abort() for the case where a fetch is completing for a finished map. Please let me know what you think.
          Hide
          jlowe Jason Lowe added a comment -

          Patch looks good to me. Please move this into the "Patch Available" state so the QA bot can comment on the patch.

          Show
          jlowe Jason Lowe added a comment - Patch looks good to me. Please move this into the "Patch Available" state so the QA bot can comment on the patch.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 1m 13s Docker mode activated.
                Prechecks
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
                trunk Compile Tests
          +1 mvninstall 19m 19s trunk passed
          +1 compile 0m 43s trunk passed
          +1 checkstyle 0m 31s trunk passed
          +1 mvnsite 0m 40s trunk passed
          +1 findbugs 1m 10s trunk passed
          +1 javadoc 0m 29s trunk passed
                Patch Compile Tests
          +1 mvninstall 0m 32s the patch passed
          +1 compile 0m 32s the patch passed
          +1 javac 0m 32s the patch passed
          -1 checkstyle 0m 24s hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core: The patch generated 5 new + 209 unchanged - 0 fixed = 214 total (was 209)
          +1 mvnsite 0m 33s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 8s the patch passed
          +1 javadoc 0m 26s the patch passed
                Other Tests
          +1 unit 3m 6s hadoop-mapreduce-client-core in the patch passed.
          +1 asflicense 0m 24s The patch does not generate ASF License warnings.
          32m 3s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:71bbb86
          JIRA Issue MAPREDUCE-6957
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12886667/MAPREDUCE-6957.002.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 541b281c9f1d 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / f1d751b
          Default Java 1.8.0_144
          findbugs v3.1.0-RC1
          checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7131/artifact/patchprocess/diff-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core.txt
          Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7131/testReport/
          modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7131/console
          Powered by Apache Yetus 0.5.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 1m 13s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.       trunk Compile Tests +1 mvninstall 19m 19s trunk passed +1 compile 0m 43s trunk passed +1 checkstyle 0m 31s trunk passed +1 mvnsite 0m 40s trunk passed +1 findbugs 1m 10s trunk passed +1 javadoc 0m 29s trunk passed       Patch Compile Tests +1 mvninstall 0m 32s the patch passed +1 compile 0m 32s the patch passed +1 javac 0m 32s the patch passed -1 checkstyle 0m 24s hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core: The patch generated 5 new + 209 unchanged - 0 fixed = 214 total (was 209) +1 mvnsite 0m 33s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 8s the patch passed +1 javadoc 0m 26s the patch passed       Other Tests +1 unit 3m 6s hadoop-mapreduce-client-core in the patch passed. +1 asflicense 0m 24s The patch does not generate ASF License warnings. 32m 3s Subsystem Report/Notes Docker Image:yetus/hadoop:71bbb86 JIRA Issue MAPREDUCE-6957 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12886667/MAPREDUCE-6957.002.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 541b281c9f1d 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / f1d751b Default Java 1.8.0_144 findbugs v3.1.0-RC1 checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7131/artifact/patchprocess/diff-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core.txt Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7131/testReport/ modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7131/console Powered by Apache Yetus 0.5.0 http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          It would be good to fix the naming convention checkstyle issue which I missed in my review, as well as the line-length nit since everything else is formatted to fit. I'm OK with the whitespace nits if you'd rather not fix those since I'm not sure it helps readability in this particular case.

          Show
          jlowe Jason Lowe added a comment - It would be good to fix the naming convention checkstyle issue which I missed in my review, as well as the line-length nit since everything else is formatted to fit. I'm OK with the whitespace nits if you'd rather not fix those since I'm not sure it helps readability in this particular case.
          Hide
          jooseong Jooseong Kim added a comment -

          Fixed checkstyle errors. Missed the errors before Sorry about that.

          Show
          jooseong Jooseong Kim added a comment - Fixed checkstyle errors. Missed the errors before Sorry about that.
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 20m 27s Docker mode activated.
                Prechecks
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
                trunk Compile Tests
          +1 mvninstall 15m 48s trunk passed
          +1 compile 0m 30s trunk passed
          +1 checkstyle 0m 20s trunk passed
          +1 mvnsite 0m 32s trunk passed
          +1 findbugs 0m 55s trunk passed
          +1 javadoc 0m 23s trunk passed
                Patch Compile Tests
          +1 mvninstall 0m 26s the patch passed
          +1 compile 0m 26s the patch passed
          +1 javac 0m 26s the patch passed
          +1 checkstyle 0m 18s the patch passed
          +1 mvnsite 0m 29s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 5s the patch passed
          +1 javadoc 0m 21s the patch passed
                Other Tests
          +1 unit 3m 0s hadoop-mapreduce-client-core in the patch passed.
          +1 asflicense 0m 14s The patch does not generate ASF License warnings.
          45m 57s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:71bbb86
          JIRA Issue MAPREDUCE-6957
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12886881/MAPREDUCE-6957.003.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 3b92538772e6 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / fa6cc43
          Default Java 1.8.0_144
          findbugs v3.1.0-RC1
          Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7132/testReport/
          modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7132/console
          Powered by Apache Yetus 0.5.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 20m 27s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.       trunk Compile Tests +1 mvninstall 15m 48s trunk passed +1 compile 0m 30s trunk passed +1 checkstyle 0m 20s trunk passed +1 mvnsite 0m 32s trunk passed +1 findbugs 0m 55s trunk passed +1 javadoc 0m 23s trunk passed       Patch Compile Tests +1 mvninstall 0m 26s the patch passed +1 compile 0m 26s the patch passed +1 javac 0m 26s the patch passed +1 checkstyle 0m 18s the patch passed +1 mvnsite 0m 29s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 5s the patch passed +1 javadoc 0m 21s the patch passed       Other Tests +1 unit 3m 0s hadoop-mapreduce-client-core in the patch passed. +1 asflicense 0m 14s The patch does not generate ASF License warnings. 45m 57s Subsystem Report/Notes Docker Image:yetus/hadoop:71bbb86 JIRA Issue MAPREDUCE-6957 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12886881/MAPREDUCE-6957.003.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 3b92538772e6 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / fa6cc43 Default Java 1.8.0_144 findbugs v3.1.0-RC1 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7132/testReport/ modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7132/console Powered by Apache Yetus 0.5.0 http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for updating the patch!

          +1 lgtm. Committing this.

          Show
          jlowe Jason Lowe added a comment - Thanks for updating the patch! +1 lgtm. Committing this.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks, Jooseong Kim! I committed this to trunk, branch-3.0, branch-2, branch-2.8, and branch-2.7.

          Show
          jlowe Jason Lowe added a comment - Thanks, Jooseong Kim ! I committed this to trunk, branch-3.0, branch-2, branch-2.8, and branch-2.7.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12865 (See https://builds.apache.org/job/Hadoop-trunk-Commit/12865/)
          MAPREDUCE-6957. shuffle hangs after a node manager connection timeout. (jlowe: rev 4d98936eec1b5d196053426c70d455cf8f83f84f)

          • (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/ShuffleSchedulerImpl.java
          • (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestShuffleScheduler.java
          • (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
          • (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12865 (See https://builds.apache.org/job/Hadoop-trunk-Commit/12865/ ) MAPREDUCE-6957 . shuffle hangs after a node manager connection timeout. (jlowe: rev 4d98936eec1b5d196053426c70d455cf8f83f84f) (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/ShuffleSchedulerImpl.java (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestShuffleScheduler.java (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
          Hide
          jooseong Jooseong Kim added a comment -

          Thank you for the quick review!

          Show
          jooseong Jooseong Kim added a comment - Thank you for the quick review!

            People

            • Assignee:
              jooseong Jooseong Kim
              Reporter:
              jooseong Jooseong Kim
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development