Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5817

mappers get rescheduled on node transition even after all reducers are completed

    Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.3.0
    • Fix Version/s: None
    • Component/s: applicationmaster
    • Labels:
    • Target Version/s:

      Description

      We're seeing a behavior where a job runs long after all reducers were already finished. We found that the job was rescheduling and running a number of mappers beyond the point of reducer completion. In one situation, the job ran for some 9 more hours after all reducers completed!

      This happens because whenever a node transition (to an unusable state) comes into the app master, it just reschedules all mappers that already ran on the node in all cases.

      Therefore, if any node transition has a potential to extend the job period. Once this window opens, another node transition can prolong it, and this can happen indefinitely in theory.

      If there is some instability in the pool (unhealthy, etc.) for a duration, then any big job is severely vulnerable to this problem.

      If all reducers have been completed, JobImpl.actOnUnusableNode() should not reschedule mapper tasks. If all reducers are completed, the mapper outputs are no longer needed, and there is no need to reschedule mapper tasks as they would not be consumed anyway.

        Issue Links

          Activity

          Hide
          Sangjin Lee added a comment -

          We're talking about two options for this: (1) modify JobImpl.actOnUnusableNode() so that if all reducers are completed do not reschedule mappers, and (2) modify checkReadyForCommit() so that it transitions to COMMITTING if all reducers are completed (if reducers exist) instead of checking all tasks are completed.

          Either approach seems to have some downsides.

          For (1), the change is pretty narrow (only affects the rescheduling scenario). However, it still lets the mapper tasks that were rescheduled prior to reducer completion run. So the job may linger until those mapper tasks run to completion. And if those mapper tasks fail for any reason, it may render the job as failed (even though all reducers may have succeeded in reality).

          For (2), it would be effective and would make the job finish much more quickly. On the other hand, we'd need to do something about the mapper tasks that are running at that point. They may need to be killed. Also, if the original mapper tasks were successful, we may need to "resurrect" their status from KILLED to SUCCESSFUL to avoid confusion.

          Show
          Sangjin Lee added a comment - We're talking about two options for this: (1) modify JobImpl.actOnUnusableNode() so that if all reducers are completed do not reschedule mappers, and (2) modify checkReadyForCommit() so that it transitions to COMMITTING if all reducers are completed (if reducers exist) instead of checking all tasks are completed. Either approach seems to have some downsides. For (1), the change is pretty narrow (only affects the rescheduling scenario). However, it still lets the mapper tasks that were rescheduled prior to reducer completion run. So the job may linger until those mapper tasks run to completion. And if those mapper tasks fail for any reason, it may render the job as failed (even though all reducers may have succeeded in reality). For (2), it would be effective and would make the job finish much more quickly. On the other hand, we'd need to do something about the mapper tasks that are running at that point. They may need to be killed. Also, if the original mapper tasks were successful, we may need to "resurrect" their status from KILLED to SUCCESSFUL to avoid confusion.
          Hide
          Sangjin Lee added a comment -

          I am leaning towards option (1) for its simplicity and smaller impact in general. It still leaves rescheduled mappers running when all reducers complete, but I think it would be a much smaller risk than the problem we're facing.

          I'll code up a patch and submit it soon. Let me know if you have suggestions/comments.

          Show
          Sangjin Lee added a comment - I am leaning towards option (1) for its simplicity and smaller impact in general. It still leaves rescheduled mappers running when all reducers complete, but I think it would be a much smaller risk than the problem we're facing. I'll code up a patch and submit it soon. Let me know if you have suggestions/comments.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12638107/mapreduce-5817.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app:

          org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4476//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4476//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12638107/mapreduce-5817.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app: org.apache.hadoop.mapreduce.v2.app.TestMRAppMaster +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4476//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4476//console This message is automatically generated.
          Hide
          Sangjin Lee added a comment -

          The test failures are unrelated to this patch. They are coming from MAPREDUCE-5815.

          Show
          Sangjin Lee added a comment - The test failures are unrelated to this patch. They are coming from MAPREDUCE-5815 .
          Hide
          Gera Shegalov added a comment -

          Thanks for working on this, Sangjin Lee. I would like to advocate for Option 2 with "resurrect" where the job moves to COMMITTING once all the output is in HDFS
          a) the job succeeds faster
          b) there is no ambiguity what mappers' output was actually consumed.

          Show
          Gera Shegalov added a comment - Thanks for working on this, Sangjin Lee . I would like to advocate for Option 2 with "resurrect" where the job moves to COMMITTING once all the output is in HDFS a) the job succeeds faster b) there is no ambiguity what mappers' output was actually consumed.
          Hide
          Sangjin Lee added a comment -

          Gera Shegalov I agree with the pros of option (2). On the other hand, I do feel uneasy about "resurrecting" a killed attempt which is necessary with option (2). I don't think that's done today, so it would be somewhat unprecedented. Also, what do you think the scope of changes would be?

          Show
          Sangjin Lee added a comment - Gera Shegalov I agree with the pros of option (2). On the other hand, I do feel uneasy about "resurrecting" a killed attempt which is necessary with option (2). I don't think that's done today, so it would be somewhat unprecedented. Also, what do you think the scope of changes would be?
          Hide
          Gera Shegalov added a comment -

          The scope is:

          We should redefine JobImpl.checkReadyForCommit to return COMMITTING when

           
          if (numReduceTasks > 0) {
             if (isucceededReduceTaskCount == numReduceTasks) 
               return COMMITTING; 
          } else if (completedTaskCount == tasks.size()) {
             return COMMITTING; 
          }
          

          To address unprecedented nature, we can introduce a new state for TaskAttempImpl LOST_COMPLETE that only mappers can go into from SUCCEEDED on lost node.

          On successful Job commit, we should generate an event that will kill all outstanding unfinished map task attempts. Move one map task attempt form LOST to SUCCEEDED (e.g, with minimum id) if there is none yet.

          Show
          Gera Shegalov added a comment - The scope is: We should redefine JobImpl.checkReadyForCommit to return COMMITTING when if (numReduceTasks > 0) { if (isucceededReduceTaskCount == numReduceTasks) return COMMITTING; } else if (completedTaskCount == tasks.size()) { return COMMITTING; } To address unprecedented nature, we can introduce a new state for TaskAttempImpl LOST_COMPLETE that only mappers can go into from SUCCEEDED on lost node. On successful Job commit, we should generate an event that will kill all outstanding unfinished map task attempts. Move one map task attempt form LOST to SUCCEEDED (e.g, with minimum id) if there is none yet.
          Hide
          Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          -1 patch 0m 0s The patch command could not apply the patch during dryrun.



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12638107/mapreduce-5817.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / f1a152c
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5557/console

          This message was automatically generated.

          Show
          Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 patch 0m 0s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12638107/mapreduce-5817.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / f1a152c Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5557/console This message was automatically generated.
          Hide
          Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          -1 patch 0m 0s The patch command could not apply the patch during dryrun.



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12638107/mapreduce-5817.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / f1a152c
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5566/console

          This message was automatically generated.

          Show
          Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 patch 0m 0s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12638107/mapreduce-5817.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / f1a152c Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5566/console This message was automatically generated.

            People

            • Assignee:
              Sangjin Lee
              Reporter:
              Sangjin Lee
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:

                Development