Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6689

MapReduce job can infinitely increase number of reducer resource requests

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 2.7.3, 2.6.5, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None

      Description

      We have seen this issue from one of our clusters: when running terasort map-reduce job, some mappers failed after reducer started, and then MR AM tries to preempt reducers to schedule these failed mappers.

      After that, MR AM enters an infinite loop, for every RMContainerAllocator#heartbeat run, it:

      • In preemptReducesIfNeeded, it cancels all scheduled reducer requests. (total scheduled reducers = 1024)
      • Then, in scheduleReduces, it ramps up all reducers (total = 1024).

      As a result, we can see total #requested-containers increased 1024 for every MRAM-RM heartbeat (1 sec per heartbeat). The AM is hanging for 18+ hours, so we get 18 * 3600 * 1024 ~ 66M+ requested containers in RM side.

      And this bug also triggered YARN-4844, which makes RM stop scheduling anything.

      Thanks to Sidharta Seethana for helping with analysis.

        Issue Links

          Activity

          Hide
          leftnoteasy Wangda Tan added a comment -

          One of the quick solution for this issue is: modify preemptReducesIfNeeded to returned if preemption happens. If preemption happens, skip the next scheduleReduces.

          CC: Karthik Kambatla, Jason Lowe.

          Show
          leftnoteasy Wangda Tan added a comment - One of the quick solution for this issue is: modify preemptReducesIfNeeded to returned if preemption happens. If preemption happens, skip the next scheduleReduces . CC: Karthik Kambatla , Jason Lowe .
          Hide
          leftnoteasy Wangda Tan added a comment -

          And also, following logic seems not correct to me:

            private void clearAllPendingReduceRequests() {
              LOG.info("Ramping down all scheduled reduces:"
                  + scheduledRequests.reduces.size());
              for (ContainerRequest req : scheduledRequests.reduces.values()) {
                pendingReduces.add(req);
              }
              scheduledRequests.reduces.clear();
            }
          

          Instead of calling scheduledRequests.reducers.clear(), it should call decContainerReq for each scheduledRequests.reduces. Existing logic will not modify remoteRequest table.

          Show
          leftnoteasy Wangda Tan added a comment - And also, following logic seems not correct to me: private void clearAllPendingReduceRequests() { LOG.info( "Ramping down all scheduled reduces:" + scheduledRequests.reduces.size()); for (ContainerRequest req : scheduledRequests.reduces.values()) { pendingReduces.add(req); } scheduledRequests.reduces.clear(); } Instead of calling scheduledRequests.reducers.clear() , it should call decContainerReq for each scheduledRequests.reduces . Existing logic will not modify remoteRequest table.
          Hide
          haibochen Haibo Chen added a comment -

          We saw an instance of the similar problem previously as well, but we were not able to investigate this in great details because logs were overridden quickly. Do you still have the MR AM logs?

          MAPREDUCE-6514 has already been created to fix the issue of not updating the requests in clearAllPendingReduceRequests().

          Show
          haibochen Haibo Chen added a comment - We saw an instance of the similar problem previously as well, but we were not able to investigate this in great details because logs were overridden quickly. Do you still have the MR AM logs? MAPREDUCE-6514 has already been created to fix the issue of not updating the requests in clearAllPendingReduceRequests().
          Hide
          leftnoteasy Wangda Tan added a comment -

          Thanks Haibo Chen pointing MAPREDUCE-6514.

          MAPREDUCE-6514 is one of cause of this problem, but it causes big trouble after MAPREDUCE-6302 committed.

          Offline discussed with Varun Saxena, I will rebase & upload a patch to MAPREDUCE-6514 later. And for this JIRA, I will fix cancel all then add all reducer requests in this JIRA.

          Application log is available at: https://www.dropbox.com/s/ckx1z993lt4ymh2/app.log.zip?dl=0. (It is too large to be uploaded to JIRA)

          Show
          leftnoteasy Wangda Tan added a comment - Thanks Haibo Chen pointing MAPREDUCE-6514 . MAPREDUCE-6514 is one of cause of this problem, but it causes big trouble after MAPREDUCE-6302 committed. Offline discussed with Varun Saxena , I will rebase & upload a patch to MAPREDUCE-6514 later. And for this JIRA, I will fix cancel all then add all reducer requests in this JIRA. Application log is available at: https://www.dropbox.com/s/ckx1z993lt4ymh2/app.log.zip?dl=0 . (It is too large to be uploaded to JIRA)
          Hide
          leftnoteasy Wangda Tan added a comment -

          Uploaded patch for this (on top of MAPREDUCE-6514)

          Show
          leftnoteasy Wangda Tan added a comment - Uploaded patch for this (on top of MAPREDUCE-6514 )
          Hide
          varun_saxena Varun Saxena added a comment -

          Just to give a background, MAPREDUCE-6514 came up while analyzing issue in MAPREDUCE-6513.
          During our analysis of the issue, there were 2 areas which we found susceptible to problems.
          One was decreasing and updating container requests when we clear pending reduce requests, something which has been handled in MAPREDUCE-6514.
          And other one was whether we should ramp up reducers at all if there are hanging maps since a certain period of time.

          We actually wanted to take feedback of community members who have worked extensively in MapReduce on these potential issues.
          Refer to comments below(made on MAPREDUCE-6513) :
          comment1 , comment2

          Basically this point got lost in between as we went ahead with rescheduling map requests with priority 5 in MAPREDUCE-6513.

          But I just thought to put it out there again.
          In RMContainerAllocator#scheduleReduces we ramp up reduces. The calculations in this method are such that this is done too aggressively with default configurations.
          Configuration of yarn.app.mapreduce.am.job.reduce.rampup.limit has the default value of 0.5. And if headroom is limited(like in MAPREDUCE-6513 scenario) i.e. barely enough to launch one mapper/reducer, because of this config value, it thinks that there is sufficient room to launch one more mapper and therefore there is no need to ramp down and reducers are ramped up. However, if this continues forever then this does not seem correct.
          Should we really be ramping up if we have hanging map requests irrespective of configuration value of reduce rampup limit ?
          We can probably use configuration introduced in MAPREDUCE-6302 to determine if maps are hanging(i.e. maps are stuck in scheduled state) and do not ramp up reduces if maps are hanging for a while. The config value of how long to wait however would depend on the kind of job being run though.

          We also have MAPREDUCE-6541 as well which adjusts headroom received from RM to find out if we have enough resources for a map task to run.

          Show
          varun_saxena Varun Saxena added a comment - Just to give a background, MAPREDUCE-6514 came up while analyzing issue in MAPREDUCE-6513 . During our analysis of the issue, there were 2 areas which we found susceptible to problems. One was decreasing and updating container requests when we clear pending reduce requests, something which has been handled in MAPREDUCE-6514 . And other one was whether we should ramp up reducers at all if there are hanging maps since a certain period of time. We actually wanted to take feedback of community members who have worked extensively in MapReduce on these potential issues. Refer to comments below(made on MAPREDUCE-6513 ) : comment1 , comment2 Basically this point got lost in between as we went ahead with rescheduling map requests with priority 5 in MAPREDUCE-6513 . But I just thought to put it out there again. In RMContainerAllocator#scheduleReduces we ramp up reduces. The calculations in this method are such that this is done too aggressively with default configurations. Configuration of yarn.app.mapreduce.am.job.reduce.rampup.limit has the default value of 0.5. And if headroom is limited(like in MAPREDUCE-6513 scenario) i.e. barely enough to launch one mapper/reducer, because of this config value, it thinks that there is sufficient room to launch one more mapper and therefore there is no need to ramp down and reducers are ramped up. However, if this continues forever then this does not seem correct. Should we really be ramping up if we have hanging map requests irrespective of configuration value of reduce rampup limit ? We can probably use configuration introduced in MAPREDUCE-6302 to determine if maps are hanging(i.e. maps are stuck in scheduled state) and do not ramp up reduces if maps are hanging for a while. The config value of how long to wait however would depend on the kind of job being run though. We also have MAPREDUCE-6541 as well which adjusts headroom received from RM to find out if we have enough resources for a map task to run.
          Hide
          jlowe Jason Lowe added a comment -

          Sorry for arriving late to the discussion.

          Should we really be ramping up if we have hanging map requests irrespective of configuration value of reduce rampup limit ?

          Ramping up reducers when maps are hanging does sound a bit dubious, but it may make sense in some scenarios. Consider a case where a job issues tons of maps, far more than the queue can handle. Some of those maps are going to appear to be hanging for a very long time because they have to run in multiple waves. The whole point of ramping up reducers before the maps are complete is to try to reduce job latency (at the expense of overall cluster throughput) by pipelining the shuffle of the completed tasks with the remaining map tasks. If the job has tons of data to shuffle for each map then it may make sense to sacrifice some of the map resources to get the reducers running early so they can start chewing on the horde of completed map output. It all depends upon the map durations, the shuffle burden, etc. It is definitely safer from a correctness point of view to avoid ramping up reducers if there are any hanging maps at all, but I believe there could be some jobs whose latency could increase as a result of that change.

          I'm guessing the root cause of the issue is an incorrect headroom report, e.g.: there's technically enough free space in the headroom but it's fragmented across nodes in such a way that no single map can fit on any node. The unconditional preemption logic from MAPREDUCE-6302 was supposed to address this, but it looks like the container allocator can quickly "forget" this decision and re-schedule the reducers that were shot.

          Show
          jlowe Jason Lowe added a comment - Sorry for arriving late to the discussion. Should we really be ramping up if we have hanging map requests irrespective of configuration value of reduce rampup limit ? Ramping up reducers when maps are hanging does sound a bit dubious, but it may make sense in some scenarios. Consider a case where a job issues tons of maps, far more than the queue can handle. Some of those maps are going to appear to be hanging for a very long time because they have to run in multiple waves. The whole point of ramping up reducers before the maps are complete is to try to reduce job latency (at the expense of overall cluster throughput) by pipelining the shuffle of the completed tasks with the remaining map tasks. If the job has tons of data to shuffle for each map then it may make sense to sacrifice some of the map resources to get the reducers running early so they can start chewing on the horde of completed map output. It all depends upon the map durations, the shuffle burden, etc. It is definitely safer from a correctness point of view to avoid ramping up reducers if there are any hanging maps at all, but I believe there could be some jobs whose latency could increase as a result of that change. I'm guessing the root cause of the issue is an incorrect headroom report, e.g.: there's technically enough free space in the headroom but it's fragmented across nodes in such a way that no single map can fit on any node. The unconditional preemption logic from MAPREDUCE-6302 was supposed to address this, but it looks like the container allocator can quickly "forget" this decision and re-schedule the reducers that were shot.
          Hide
          jlowe Jason Lowe added a comment -

          Patch looks good to me, submitting to Jenkins now that MAPREDUCE-6514 has been committed.

          Show
          jlowe Jason Lowe added a comment - Patch looks good to me, submitting to Jenkins now that MAPREDUCE-6514 has been committed.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 13s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 7m 24s trunk passed
          +1 compile 0m 22s trunk passed with JDK v1.8.0_91
          +1 compile 0m 25s trunk passed with JDK v1.7.0_95
          +1 checkstyle 0m 17s trunk passed
          +1 mvnsite 0m 27s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 findbugs 0m 42s trunk passed
          +1 javadoc 0m 15s trunk passed with JDK v1.8.0_91
          +1 javadoc 0m 18s trunk passed with JDK v1.7.0_95
          +1 mvninstall 0m 23s the patch passed
          +1 compile 0m 19s the patch passed with JDK v1.8.0_91
          +1 javac 0m 19s the patch passed
          +1 compile 0m 22s the patch passed with JDK v1.7.0_95
          +1 javac 0m 22s the patch passed
          -1 checkstyle 0m 14s hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app: patch generated 1 new + 111 unchanged - 0 fixed = 112 total (was 111)
          +1 mvnsite 0m 26s the patch passed
          +1 mvneclipse 0m 11s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 0m 58s the patch passed
          +1 javadoc 0m 13s the patch passed with JDK v1.8.0_91
          +1 javadoc 0m 15s the patch passed with JDK v1.7.0_95
          +1 unit 8m 32s hadoop-mapreduce-client-app in the patch passed with JDK v1.8.0_91.
          +1 unit 9m 16s hadoop-mapreduce-client-app in the patch passed with JDK v1.7.0_95.
          +1 asflicense 0m 20s Patch does not generate ASF License warnings.
          33m 4s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:cf2ee45
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12802512/MAPREDUCE-6689.1.patch
          JIRA Issue MAPREDUCE-6689
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 009df3d66296 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 2835f14
          Default Java 1.7.0_95
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6484/artifact/patchprocess/diff-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt
          JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6484/testReport/
          modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6484/console
          Powered by Apache Yetus 0.2.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 13s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 24s trunk passed +1 compile 0m 22s trunk passed with JDK v1.8.0_91 +1 compile 0m 25s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 17s trunk passed +1 mvnsite 0m 27s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 0m 42s trunk passed +1 javadoc 0m 15s trunk passed with JDK v1.8.0_91 +1 javadoc 0m 18s trunk passed with JDK v1.7.0_95 +1 mvninstall 0m 23s the patch passed +1 compile 0m 19s the patch passed with JDK v1.8.0_91 +1 javac 0m 19s the patch passed +1 compile 0m 22s the patch passed with JDK v1.7.0_95 +1 javac 0m 22s the patch passed -1 checkstyle 0m 14s hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app: patch generated 1 new + 111 unchanged - 0 fixed = 112 total (was 111) +1 mvnsite 0m 26s the patch passed +1 mvneclipse 0m 11s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 0m 58s the patch passed +1 javadoc 0m 13s the patch passed with JDK v1.8.0_91 +1 javadoc 0m 15s the patch passed with JDK v1.7.0_95 +1 unit 8m 32s hadoop-mapreduce-client-app in the patch passed with JDK v1.8.0_91. +1 unit 9m 16s hadoop-mapreduce-client-app in the patch passed with JDK v1.7.0_95. +1 asflicense 0m 20s Patch does not generate ASF License warnings. 33m 4s Subsystem Report/Notes Docker Image:yetus/hadoop:cf2ee45 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12802512/MAPREDUCE-6689.1.patch JIRA Issue MAPREDUCE-6689 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 009df3d66296 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 2835f14 Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6484/artifact/patchprocess/diff-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6484/testReport/ modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6484/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          +1 lgtm. Will commit this later today if there are no objections.

          Show
          jlowe Jason Lowe added a comment - +1 lgtm. Will commit this later today if there are no objections.
          Hide
          varun_saxena Varun Saxena added a comment -

          Agree.
          If we do introduce a config to decide whether maps have been starved or not(and hence not ramp up reducers), this will have to be tuned according to job(not only based on type of job but even the size of data it processes in each run. And several other factors).
          I do see that it will be almost impossible to accurately decide a correct value for such a config.

          We do have fix made in MAPREDUCE-6514 in our private branch since several months. But do not yet have MAPREDUCE-6302 in.
          Let us see how the recent fixes alongwith MAPREDUCE-6302 go on a real cluster. I think it should cover most of the scenarios.

          Show
          varun_saxena Varun Saxena added a comment - Agree. If we do introduce a config to decide whether maps have been starved or not(and hence not ramp up reducers), this will have to be tuned according to job(not only based on type of job but even the size of data it processes in each run. And several other factors). I do see that it will be almost impossible to accurately decide a correct value for such a config. We do have fix made in MAPREDUCE-6514 in our private branch since several months. But do not yet have MAPREDUCE-6302 in. Let us see how the recent fixes alongwith MAPREDUCE-6302 go on a real cluster. I think it should cover most of the scenarios.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Yeah this is majorly caused by inaccurate headroom calculation. IMO, headroom responded from scheduler is not trustable. It missed many constraints, for example, blacklist, hard locality, etc. And it gonna be expensive to calculate everything inside scheduler. So MAPREDUCE-6302 is a good solution to me. However, we shouldn't "forget" this decision and re-schedule the reducers on the same shot.

          Thanks for reviews, Jason Lowe/Varun Saxena.

          Show
          leftnoteasy Wangda Tan added a comment - Yeah this is majorly caused by inaccurate headroom calculation. IMO, headroom responded from scheduler is not trustable. It missed many constraints, for example, blacklist, hard locality, etc. And it gonna be expensive to calculate everything inside scheduler. So MAPREDUCE-6302 is a good solution to me. However, we shouldn't "forget" this decision and re-schedule the reducers on the same shot. Thanks for reviews, Jason Lowe / Varun Saxena .
          Hide
          jlowe Jason Lowe added a comment -

          Thanks to Wangda Tan for the contribution and to Varun Saxena for additional review! I committed this to trunk, branch-2, branch-2.8, branch-2.7, and branch-2.6.

          Show
          jlowe Jason Lowe added a comment - Thanks to Wangda Tan for the contribution and to Varun Saxena for additional review! I committed this to trunk, branch-2, branch-2.8, branch-2.7, and branch-2.6.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #9731 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9731/)
          MAPREDUCE-6689. MapReduce job can infinitely increase number of reducer (jlowe: rev c9bb96fa81fc925e33ccc0b02c98cc2d929df120)

          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/rm/TestRMContainerAllocator.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #9731 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9731/ ) MAPREDUCE-6689 . MapReduce job can infinitely increase number of reducer (jlowe: rev c9bb96fa81fc925e33ccc0b02c98cc2d929df120) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/rm/TestRMContainerAllocator.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Closing the JIRA as part of 2.7.3 release.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Closing the JIRA as part of 2.7.3 release.

            People

            • Assignee:
              leftnoteasy Wangda Tan
              Reporter:
              leftnoteasy Wangda Tan
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development