Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3921

MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 2.0.2-alpha
    • Component/s: mr-am, mrv2
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Target Version/s:
    1. MAPREDUCE-3921.patch
      47 kB
      Bikas Saha
    2. MAPREDUCE-3921-1.patch
      50 kB
      Bikas Saha
    3. MAPREDUCE-3921-10.patch
      64 kB
      Bikas Saha
    4. MAPREDUCE-3921-11.patch
      64 kB
      Bikas Saha
    5. MAPREDUCE-3921-3.patch
      51 kB
      Bikas Saha
    6. MAPREDUCE-3921-4.patch
      51 kB
      Bikas Saha
    7. MAPREDUCE-3921-5.patch
      55 kB
      Bikas Saha
    8. MAPREDUCE-3921-6.patch
      55 kB
      Bikas Saha
    9. MAPREDUCE-3921-7.patch
      59 kB
      Bikas Saha
    10. MAPREDUCE-3921-9.patch
      59 kB
      Bikas Saha
    11. MAPREDUCE-3921-branch-0.23.patch
      44 kB
      Bikas Saha
    12. MAPREDUCE-3921-branch-0.23.patch
      44 kB
      Bikas Saha
    13. MAPREDUCE-3921-branch-0.23.patch
      44 kB
      Bikas Saha

      Issue Links

        Activity

        Hide
        Vinod Kumar Vavilapalli added a comment -

        MR AM needs RM to support this information. Linking the tickets.

        Show
        Vinod Kumar Vavilapalli added a comment - MR AM needs RM to support this information. Linking the tickets.
        Hide
        Bikas Saha added a comment -

        Attaching a patch that builds on MAPREDUCE-3353
        1) RMContainerAllocator receives node updates along with allocated containers
        2) It sends KILL event to map task attempts running on unusable nodes
        3) It sends a JobUpdatedNode event to JobImpl
        4) JobImpl maintains a mapping of nodes to successful task attempts that have run on them
        5) On receiving updated nodes JobImpl sends KILL event to map task attempts from 4)
        6) Successful task completions retro actively get to the KILLED state if their successful task attempt is the same as the task attempt in 5). They reschedule another attempt.

        Show
        Bikas Saha added a comment - Attaching a patch that builds on MAPREDUCE-3353 1) RMContainerAllocator receives node updates along with allocated containers 2) It sends KILL event to map task attempts running on unusable nodes 3) It sends a JobUpdatedNode event to JobImpl 4) JobImpl maintains a mapping of nodes to successful task attempts that have run on them 5) On receiving updated nodes JobImpl sends KILL event to map task attempts from 4) 6) Successful task completions retro actively get to the KILLED state if their successful task attempt is the same as the task attempt in 5). They reschedule another attempt.
        Hide
        Bikas Saha added a comment -

        Patch with above implemented.

        Show
        Bikas Saha added a comment - Patch with above implemented.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12519982/MAPREDUCE-3921-branch-0.23.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 12 new or modified tests.

        -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

        -1 javac. The applied patch generated 510 javac compiler warnings (more than the trunk's current 507 warnings).

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2102//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2102//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12519982/MAPREDUCE-3921-branch-0.23.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. -1 javac. The applied patch generated 510 javac compiler warnings (more than the trunk's current 507 warnings). +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2102//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2102//console This message is automatically generated.
        Hide
        Bikas Saha added a comment -

        The javadoc warning and 1 javac warning was because of a spurious import of sun libraries that Eclipse had inserted.
        The remaining 2 javac warnings are similar to existing warnings.
        ======
        [WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java:[626,25] [unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler
        [WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java:[647,29] [unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler
        ======

        Show
        Bikas Saha added a comment - The javadoc warning and 1 javac warning was because of a spurious import of sun libraries that Eclipse had inserted. The remaining 2 javac warnings are similar to existing warnings. ====== [WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java: [626,25] [unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler [WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java: [647,29] [unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler ======
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12519999/MAPREDUCE-3921-branch-0.23.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 12 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        -1 javac. The applied patch generated 509 javac compiler warnings (more than the trunk's current 507 warnings).

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2105//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2105//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12519999/MAPREDUCE-3921-branch-0.23.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 509 javac compiler warnings (more than the trunk's current 507 warnings). +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2105//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2105//console This message is automatically generated.
        Hide
        Bikas Saha added a comment -

        Some cleanup and creating new diff wrt trunk since the previous one was for 0.23

        Show
        Bikas Saha added a comment - Some cleanup and creating new diff wrt trunk since the previous one was for 0.23
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12521616/MAPREDUCE-3921.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified test files.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        -1 javac. The applied patch generated 509 javac compiler warnings (more than the trunk's current 507 warnings).

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService
        org.apache.hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry
        org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization
        org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2165//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2165//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12521616/MAPREDUCE-3921.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified test files. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 509 javac compiler warnings (more than the trunk's current 507 warnings). +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService org.apache.hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2165//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2165//console This message is automatically generated.
        Hide
        Robert Joseph Evans added a comment -

        A few minor comments about the patch, and some questions on the manual testing that was done on it. Overall the patch looks very good and once the javac warnings are addressed and I know manual testing was performed I am a +1 on it

        1. Have you tested this with AM Recovery? Specifically I would like to see the AM recover when a map task finished successfully and then was killed because the node went bad.
        2. Have you tested this with reduces? The code will reschedule the map task, but I don't really see where/if it informs the reducer that it is rescheduling the map task until that new task finishes successfully. I believe that the reducer would just ignore an update for a task it has already fetched successfully, but I just want to be sure it was tested.
        3. NodeState.isUnhealthy() (Very minor) I think it would be cleaner, to have it be
          return this == UNHEALTHY ||
                 this == DECOMMISSIONED ||
                 this == LOST;
          
        4. KilledAfterSuccessTransition.transition() There is some commented out code
          // why set a wrong finish time ???
          //set the finish time
          //taskAttempt.setFinishTime();
          

          Is this needed? If not please remove it.

        5. KilledAfterSuccessTransition.transition() I am a bit confused by the log statement
          if (taskAttempt.getLaunchTime() != 0) {
            ...
          }else {
            LOG.debug("Not generating HistoryFinish event since start event not generated for taskAttempt: "
                + taskAttempt.getID());
          }
          

          Is this really needed (looks like it was a copy and paste from the KilledTransition)? When would we even get a successful job that did not have a launch time? I would rather have it be an ERROR or WARN rather then a debug if we did see this in this transition.

        6. TaskAttemptCompletedEventTransition.transition()
          // TODO assert nodeId is not null
          

          please either add in the assert or remove the TODO.

        Show
        Robert Joseph Evans added a comment - A few minor comments about the patch, and some questions on the manual testing that was done on it. Overall the patch looks very good and once the javac warnings are addressed and I know manual testing was performed I am a +1 on it Have you tested this with AM Recovery? Specifically I would like to see the AM recover when a map task finished successfully and then was killed because the node went bad. Have you tested this with reduces? The code will reschedule the map task, but I don't really see where/if it informs the reducer that it is rescheduling the map task until that new task finishes successfully. I believe that the reducer would just ignore an update for a task it has already fetched successfully, but I just want to be sure it was tested. NodeState.isUnhealthy() (Very minor) I think it would be cleaner, to have it be return this == UNHEALTHY || this == DECOMMISSIONED || this == LOST; KilledAfterSuccessTransition.transition() There is some commented out code // why set a wrong finish time ??? //set the finish time //taskAttempt.setFinishTime(); Is this needed? If not please remove it. KilledAfterSuccessTransition.transition() I am a bit confused by the log statement if (taskAttempt.getLaunchTime() != 0) { ... } else { LOG.debug( "Not generating HistoryFinish event since start event not generated for taskAttempt: " + taskAttempt.getID()); } Is this really needed (looks like it was a copy and paste from the KilledTransition)? When would we even get a successful job that did not have a launch time? I would rather have it be an ERROR or WARN rather then a debug if we did see this in this transition. TaskAttemptCompletedEventTransition.transition() // TODO assert nodeId is not null please either add in the assert or remove the TODO.
        Hide
        Bikas Saha added a comment -

        1) By AM Recovery do you mean recovery after restart? The test added in this patch check that the AM restarts a previously successful task when the node (on which it ran) goes bad. See TestMRApp.java.
        2) My understanding was that the new version of the map is a pre-emptively created copy. The running reduces would use their existing inputs. Is that not the case? Are reducers informed about new locations for map outputs on the fly?
        4)6) The comments are for reviewers to clarify those points. eg. Some of the code was taken from similar actions elsewhere. They set the finish time and I was not sure if that was the correct thing to do. I dont think the assert is necessary given the current code but do you usually put in asserts?
        5) The log means that this task was not started and hence further history events are not being added. This is similar to other places in the code

        Show
        Bikas Saha added a comment - 1) By AM Recovery do you mean recovery after restart? The test added in this patch check that the AM restarts a previously successful task when the node (on which it ran) goes bad. See TestMRApp.java. 2) My understanding was that the new version of the map is a pre-emptively created copy. The running reduces would use their existing inputs. Is that not the case? Are reducers informed about new locations for map outputs on the fly? 4)6) The comments are for reviewers to clarify those points. eg. Some of the code was taken from similar actions elsewhere. They set the finish time and I was not sure if that was the correct thing to do. I dont think the assert is necessary given the current code but do you usually put in asserts? 5) The log means that this task was not started and hence further history events are not being added. This is similar to other places in the code
        Hide
        Robert Joseph Evans added a comment -

        For the testing I just want to be sure that nothing catastrophically bad happens in these cases. If a failed task is not detected until the reducer fails to fetch data from it, that is fine with me, but if the AM dies or hangs, or if there is some how data corruption I really would like to avoid those.

        By AM Recovery I mean that when the AM dies, i.e. it was on a bad node, the RM will restart it. The AM then looks through the JobHistory logs to find out which tasks finished successfully before it died, and which ones need to be restarted. I just want to be sure that if a map task is restarted because a node is unhealthy and the AM also is restarted that the recovery code will handle that case correctly.

        Are reducers informed about new locations for map outputs on the fly?

        That is my understanding otherwise no reducer could be launched until all mappers had finished, and all reducers would have to be relaunched if a map task disappeared on a bad node.

        I dont think the assert is necessary given the current code but do you usually put in asserts?

        I don't usually put in asserts. But I don't really like dangling TODO's lying around. If it is something that needs to be done I feel we should either do it or file a JIRA to track it so it gets done. If it is not something that needs to be done then we don't need a TODO for it. If this is a copy and paste TODO I am OK with leaving it. That is the reason I did not comment on the other TODOs added into the code, I could see where they were copied from.

        The log means that this task was not started and hence further history events are not being added. This is similar to other places in the code

        Yes I can see the place where it was copied from. What I am referring to is that the KilledTransition, where this looks like it came from, handles the kill event coming in from many different states. In some of these states it is reasonable to have a launch time of 0. In KilledAfterSuccessTransition, as the name implies, it seems very difficult to have a taskattempt in the "SUCCESS" state that had no launch time. A task that finished successfully but was never run seems odd to me, if you want to leave it for defensive programming I am happy to, but I would prefer the log message to not be debug so someone looking can see that something odd happened here.

        The comments are for reviewers to clarify those points. eg. Some of the code was taken from similar actions elsewhere. They set the finish time and I was not sure if that was the correct thing to do.

        It seems logical that if you are killing a task that we want to be sure the finish time is set, so just set it, but that should already have been set for the SUCCESS case, so I would just leave it off, but I really don't know for sure.

        Show
        Robert Joseph Evans added a comment - For the testing I just want to be sure that nothing catastrophically bad happens in these cases. If a failed task is not detected until the reducer fails to fetch data from it, that is fine with me, but if the AM dies or hangs, or if there is some how data corruption I really would like to avoid those. By AM Recovery I mean that when the AM dies, i.e. it was on a bad node, the RM will restart it. The AM then looks through the JobHistory logs to find out which tasks finished successfully before it died, and which ones need to be restarted. I just want to be sure that if a map task is restarted because a node is unhealthy and the AM also is restarted that the recovery code will handle that case correctly. Are reducers informed about new locations for map outputs on the fly? That is my understanding otherwise no reducer could be launched until all mappers had finished, and all reducers would have to be relaunched if a map task disappeared on a bad node. I dont think the assert is necessary given the current code but do you usually put in asserts? I don't usually put in asserts. But I don't really like dangling TODO's lying around. If it is something that needs to be done I feel we should either do it or file a JIRA to track it so it gets done. If it is not something that needs to be done then we don't need a TODO for it. If this is a copy and paste TODO I am OK with leaving it. That is the reason I did not comment on the other TODOs added into the code, I could see where they were copied from. The log means that this task was not started and hence further history events are not being added. This is similar to other places in the code Yes I can see the place where it was copied from. What I am referring to is that the KilledTransition, where this looks like it came from, handles the kill event coming in from many different states. In some of these states it is reasonable to have a launch time of 0. In KilledAfterSuccessTransition, as the name implies, it seems very difficult to have a taskattempt in the "SUCCESS" state that had no launch time. A task that finished successfully but was never run seems odd to me, if you want to leave it for defensive programming I am happy to, but I would prefer the log message to not be debug so someone looking can see that something odd happened here. The comments are for reviewers to clarify those points. eg. Some of the code was taken from similar actions elsewhere. They set the finish time and I was not sure if that was the correct thing to do. It seems logical that if you are killing a task that we want to be sure the finish time is set, so just set it, but that should already have been set for the SUCCESS case, so I would just leave it off, but I really don't know for sure.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12521796/MAPREDUCE-3921-1.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified test files.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        -1 javac. The applied patch generated 509 javac compiler warnings (more than the trunk's current 507 warnings).

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2171//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2171//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12521796/MAPREDUCE-3921-1.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified test files. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 509 javac compiler warnings (more than the trunk's current 507 warnings). +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2171//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2171//console This message is automatically generated.
        Hide
        Bikas Saha added a comment -

        Attached new patch.
        1) Cleaned up asserts, logs and minor comments. javac warnings are same as pre-existing warnings around the use of raw types for events.
        2) Removed newly added TaskEventType.T_ATTEMPT_KILLED_AFTER_SUCCESS with existing TaskEventType.T_ATTEMPT_KILLED. The successful attempt was being killed and it makes sense to reuse existing code flow. There was some reason (which is lost in my notes) for which I had added a new event type but after looking at the code I dont see any reason to do so now.
        3) All map task completion events (succeeded, killed etc) are being synced with the reducers. When a map task is killed because of a bad node, that event will be sent to the reducer. Then when it completes, the reducer will know about it. Just like any other case of change in map outputs. All of this is pre-existing functionality based on my understanding of the code and talking offline with Vinod. So your concerns about informing the reducers about the newly killed map task are already
        addressed by the pre-existing code flow.
        4) AM recovery. I was having trouble trying to manually create failures of a real cluster. So I went ahead and enhanced the newly added TestMRApp.testUpdatedNodes() with AM recovery. The test now checks for successful tasks being killed and rerun on node failure. Then the AM is restarted and the test verifies that those completed tasks are recovered. While that worked and this patch passed the tests, a variant of the test exposed a different problem.

        In recovery mode, the recovery service assigns a success status to any task that has a FINISHED event reported. The only way that status could be changed is if there is a FAILED event for that task, in which case a failed status is assigned to that task. So once a task is marked with a success status, it remains so even when subsequent events kill the successful task attempt and mark it invalid.
        Next the recovery service adds all success status tasks into a completedTasks collection. Then it proceeds to enumerate the events and process them. When it hits a TaskEventType.*_KILLED/FAILED/SUCCEEDED then it removes those attempts from the completedTasks. Recovery does not complete until all attempts of all completedTasks are removed. Now the following sequence of events can happen for Tasks A and B. A1 represents task attempt 1 of A.
        CompletedTasks contains A and B. A1 and A2 are succeeded. A2 was a rerun of A1. B1 is succeeded and B2 was running when AM crashed.
        A1- container request is processed. It uses the nodeid info from A1 to work.
        B1- container request is processed. It uses the nodeid info from B1 to work.
        A1- Succeeded removes A1
        B1- Succeeded removed B1
        A2- container request is processed. It uses the nodeid info from A2 to work
        B2- container request is processed. It uses the nodeid info from B2 to work. But there is no such info as it is populated on task completion. AM crashed here while trying to resolve the nodeid.
        If AM had not crashed the following would have happened
        A2- Succeeded removes A2
        There is no FAILED/KILLED/SUCCEEDED event for B2 since it was running when the AM crashed. So it seems the AM would never move out of recovery.

        If the above is correct, there seems to be 2 problems
        1) While recovery is in process, event handling for task attempts that are not in a completed state. I am not sure if the recovery design allows this and the current crash is simply a case of missing info.
        2) Expecting every task attempt of a completedTask to have a KILLED/FAILED/SUCCEEDED entry. This seems to be clearly wrong in the current scenario.

        Show
        Bikas Saha added a comment - Attached new patch. 1) Cleaned up asserts, logs and minor comments. javac warnings are same as pre-existing warnings around the use of raw types for events. 2) Removed newly added TaskEventType.T_ATTEMPT_KILLED_AFTER_SUCCESS with existing TaskEventType.T_ATTEMPT_KILLED. The successful attempt was being killed and it makes sense to reuse existing code flow. There was some reason (which is lost in my notes) for which I had added a new event type but after looking at the code I dont see any reason to do so now. 3) All map task completion events (succeeded, killed etc) are being synced with the reducers. When a map task is killed because of a bad node, that event will be sent to the reducer. Then when it completes, the reducer will know about it. Just like any other case of change in map outputs. All of this is pre-existing functionality based on my understanding of the code and talking offline with Vinod. So your concerns about informing the reducers about the newly killed map task are already addressed by the pre-existing code flow. 4) AM recovery. I was having trouble trying to manually create failures of a real cluster. So I went ahead and enhanced the newly added TestMRApp.testUpdatedNodes() with AM recovery. The test now checks for successful tasks being killed and rerun on node failure. Then the AM is restarted and the test verifies that those completed tasks are recovered. While that worked and this patch passed the tests, a variant of the test exposed a different problem. In recovery mode, the recovery service assigns a success status to any task that has a FINISHED event reported. The only way that status could be changed is if there is a FAILED event for that task, in which case a failed status is assigned to that task. So once a task is marked with a success status, it remains so even when subsequent events kill the successful task attempt and mark it invalid. Next the recovery service adds all success status tasks into a completedTasks collection. Then it proceeds to enumerate the events and process them. When it hits a TaskEventType.*_KILLED/FAILED/SUCCEEDED then it removes those attempts from the completedTasks. Recovery does not complete until all attempts of all completedTasks are removed. Now the following sequence of events can happen for Tasks A and B. A1 represents task attempt 1 of A. CompletedTasks contains A and B. A1 and A2 are succeeded. A2 was a rerun of A1. B1 is succeeded and B2 was running when AM crashed. A1- container request is processed. It uses the nodeid info from A1 to work. B1- container request is processed. It uses the nodeid info from B1 to work. A1- Succeeded removes A1 B1- Succeeded removed B1 A2- container request is processed. It uses the nodeid info from A2 to work B2- container request is processed. It uses the nodeid info from B2 to work. But there is no such info as it is populated on task completion. AM crashed here while trying to resolve the nodeid. If AM had not crashed the following would have happened A2- Succeeded removes A2 There is no FAILED/KILLED/SUCCEEDED event for B2 since it was running when the AM crashed. So it seems the AM would never move out of recovery. If the above is correct, there seems to be 2 problems 1) While recovery is in process, event handling for task attempts that are not in a completed state. I am not sure if the recovery design allows this and the current crash is simply a case of missing info. 2) Expecting every task attempt of a completedTask to have a KILLED/FAILED/SUCCEEDED entry. This seems to be clearly wrong in the current scenario.
        Hide
        Robert Joseph Evans added a comment -

        Kicking Jenkins.

        Show
        Robert Joseph Evans added a comment - Kicking Jenkins.
        Hide
        Robert Joseph Evans added a comment -

        I did a quick look at the code and it looks good to me. As for the recovery error you discovered could you please file a follow up JIRA for it, as it is a preexisting issue that can be caused by AM recovery with speculative execution. This patch may expose the issue more frequently, but not enough to really worry me that much. You need two nodes going down very close to one another which is possible, but not that often.

        Show
        Robert Joseph Evans added a comment - I did a quick look at the code and it looks good to me. As for the recovery error you discovered could you please file a follow up JIRA for it, as it is a preexisting issue that can be caused by AM recovery with speculative execution. This patch may expose the issue more frequently, but not enough to really worry me that much. You need two nodes going down very close to one another which is possible, but not that often.
        Hide
        Robert Joseph Evans added a comment -

        Someone pointed out to me that my comment is a bit confusing. When I said two nodes going down very close to one another I meant that for this to happen we would need one node to go down in succession that had the correct processes running on them. But now that I think about it more, I am not even sure if it will expose the issue.

        Show
        Robert Joseph Evans added a comment - Someone pointed out to me that my comment is a bit confusing. When I said two nodes going down very close to one another I meant that for this to happen we would need one node to go down in succession that had the correct processes running on them. But now that I think about it more, I am not even sure if it will expose the issue.
        Hide
        Robert Joseph Evans added a comment -

        I have been looking at the patch and I think it looks good to me, but it is rather large, and there are some unanswered questions in the code that I cannot answer so I would feel more comfortable if Vinod or Sid gave it a quick once over before I checked it in.

        Also it looks like some of the includes in TestMRApp.java have changed and a quick upmerge would be good for it to apply cleanly.

        Show
        Robert Joseph Evans added a comment - I have been looking at the patch and I think it looks good to me, but it is rather large, and there are some unanswered questions in the code that I cannot answer so I would feel more comfortable if Vinod or Sid gave it a quick once over before I checked it in. Also it looks like some of the includes in TestMRApp.java have changed and a quick upmerge would be good for it to apply cleanly.
        Hide
        Bikas Saha added a comment -

        Thanks! Robert, could you look at MAPREDUCE-4128 please? That pretty small and would make this less risky.

        I have a new patch for this based on the patch for MAPREDUCE-4128. That will clean up the patch based off latest changes to trunk.

        Show
        Bikas Saha added a comment - Thanks! Robert, could you look at MAPREDUCE-4128 please? That pretty small and would make this less risky. I have a new patch for this based on the patch for MAPREDUCE-4128 . That will clean up the patch based off latest changes to trunk.
        Hide
        Bikas Saha added a comment -

        Attaching patch after pulling latest changes and improved test for AM recovery.

        Show
        Bikas Saha added a comment - Attaching patch after pulling latest changes and improved test for AM recovery.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12522596/MAPREDUCE-3921-3.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified test files.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        -1 javac. The applied patch generated 508 javac compiler warnings (more than the trunk's current 506 warnings).

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell
        org.apache.hadoop.yarn.server.TestDiskFailures
        org.apache.hadoop.yarn.server.TestContainerManagerSecurity
        org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService
        org.apache.hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry
        org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization
        org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs
        org.apache.hadoop.mapred.TestMiniMRClasspath
        org.apache.hadoop.mapreduce.v2.TestMRJobs
        org.apache.hadoop.mapred.TestMiniMRWithDFSWithDistinctUsers
        org.apache.hadoop.mapred.TestMiniMRBringup
        org.apache.hadoop.mapred.TestMiniMRChildTask
        org.apache.hadoop.mapred.TestReduceFetch
        org.apache.hadoop.mapred.TestClusterMRNotification
        org.apache.hadoop.mapred.TestReduceFetchFromPartialMem
        org.apache.hadoop.mapred.TestJobCounters
        org.apache.hadoop.mapreduce.TestChild
        org.apache.hadoop.mapred.TestMiniMRClientCluster
        org.apache.hadoop.ipc.TestSocketFactory
        org.apache.hadoop.mapreduce.v2.TestMRJobsWithHistoryService
        org.apache.hadoop.mapreduce.v2.TestMROldApiJobs
        org.apache.hadoop.mapreduce.v2.TestSpeculativeExecution
        org.apache.hadoop.mapreduce.lib.output.TestJobOutputCommitter
        org.apache.hadoop.mapred.TestClientRedirect
        org.apache.hadoop.mapred.TestLazyOutput
        org.apache.hadoop.mapred.TestJobCleanup
        org.apache.hadoop.mapreduce.TestMapReduceLazyOutput
        org.apache.hadoop.mapred.TestSpecialCharactersInOutputPath
        org.apache.hadoop.mapreduce.v2.TestMRAppWithCombiner
        org.apache.hadoop.conf.TestNoDefaultsJobConf
        org.apache.hadoop.mapreduce.v2.TestRMNMInfo
        org.apache.hadoop.mapred.TestClusterMapReduceTestCase
        org.apache.hadoop.mapreduce.v2.TestNonExistentJob
        org.apache.hadoop.mapred.TestJobSysDirWithDFS
        org.apache.hadoop.mapreduce.v2.TestUberAM
        org.apache.hadoop.mapreduce.v2.TestMiniMRProxyUser
        org.apache.hadoop.mapred.TestJobName
        org.apache.hadoop.mapreduce.security.TestJHSSecurity

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2222//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2222//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12522596/MAPREDUCE-3921-3.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified test files. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 508 javac compiler warnings (more than the trunk's current 506 warnings). +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell org.apache.hadoop.yarn.server.TestDiskFailures org.apache.hadoop.yarn.server.TestContainerManagerSecurity org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService org.apache.hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs org.apache.hadoop.mapred.TestMiniMRClasspath org.apache.hadoop.mapreduce.v2.TestMRJobs org.apache.hadoop.mapred.TestMiniMRWithDFSWithDistinctUsers org.apache.hadoop.mapred.TestMiniMRBringup org.apache.hadoop.mapred.TestMiniMRChildTask org.apache.hadoop.mapred.TestReduceFetch org.apache.hadoop.mapred.TestClusterMRNotification org.apache.hadoop.mapred.TestReduceFetchFromPartialMem org.apache.hadoop.mapred.TestJobCounters org.apache.hadoop.mapreduce.TestChild org.apache.hadoop.mapred.TestMiniMRClientCluster org.apache.hadoop.ipc.TestSocketFactory org.apache.hadoop.mapreduce.v2.TestMRJobsWithHistoryService org.apache.hadoop.mapreduce.v2.TestMROldApiJobs org.apache.hadoop.mapreduce.v2.TestSpeculativeExecution org.apache.hadoop.mapreduce.lib.output.TestJobOutputCommitter org.apache.hadoop.mapred.TestClientRedirect org.apache.hadoop.mapred.TestLazyOutput org.apache.hadoop.mapred.TestJobCleanup org.apache.hadoop.mapreduce.TestMapReduceLazyOutput org.apache.hadoop.mapred.TestSpecialCharactersInOutputPath org.apache.hadoop.mapreduce.v2.TestMRAppWithCombiner org.apache.hadoop.conf.TestNoDefaultsJobConf org.apache.hadoop.mapreduce.v2.TestRMNMInfo org.apache.hadoop.mapred.TestClusterMapReduceTestCase org.apache.hadoop.mapreduce.v2.TestNonExistentJob org.apache.hadoop.mapred.TestJobSysDirWithDFS org.apache.hadoop.mapreduce.v2.TestUberAM org.apache.hadoop.mapreduce.v2.TestMiniMRProxyUser org.apache.hadoop.mapred.TestJobName org.apache.hadoop.mapreduce.security.TestJHSSecurity +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2222//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2222//console This message is automatically generated.
        Hide
        Bikas Saha added a comment -

        New patch with latest synced changes from trunk.

        Show
        Bikas Saha added a comment - New patch with latest synced changes from trunk.
        Hide
        Siddharth Seth added a comment -

        Couple of questions and suggestions.

        • Does 'node unhealthy' need to be treated differently from 'TooManyFetchFailures' ? Killed versus Failed. This ends up with NodeFailures not counting towards the limit on task attempts.
        • JOB_UPDATED_NODES needs to be handled in the JOB_INIT state. Very small chance of hitting this.
        • Minor: JobImpl.actOnUsableNode can get the task type from the id itself. It doesn't need to fetch the actual task.
        • Minor: "if this attempt is not successful" this comment in JobImpl can be removed. It's removing an entry from a successfulAttempt index.
        • In KilledAfterSuccessTransition - createJobCounterUpdateEventTAFailed should be createJobCounterUpdateEventTAKilled
        • TaskImpl.handleAttemptCompletion - finishedAttempts - this will end up double counting the same task attempt. It's used in some other transition.
        • Does the JobHistoryParser need some more changes - to unset fields which may have been set previously by Map/ReduceAttemptSuccessfulEvents and TaskFinishedEvent
        • For running tasks - shouldn't running Reduce attempts also be killed ?
        • RMContainerAllocator.handleUpdatedNodes - instead of fetching the nodeId via appContext, job etc - the nodeId can be stored with the AssignedRequest. 1) getTask, getAttempt require readLocks - can avoid these calls every second. 2) There's an unlikely race where the nodeId may not be assigned in the TaskAttempt (if the dispatcher thread is backlogged).
        • TaskAttemptId.getNodeId() can be avoided. getContainerManagerAddress can be used instead.

        Not related to this patch.
        Does JOB_TASK_COMPLETED need to be handled (ignored) in additional states?

        Show
        Siddharth Seth added a comment - Couple of questions and suggestions. Does 'node unhealthy' need to be treated differently from 'TooManyFetchFailures' ? Killed versus Failed. This ends up with NodeFailures not counting towards the limit on task attempts. JOB_UPDATED_NODES needs to be handled in the JOB_INIT state. Very small chance of hitting this. Minor: JobImpl.actOnUsableNode can get the task type from the id itself. It doesn't need to fetch the actual task. Minor: "if this attempt is not successful" this comment in JobImpl can be removed. It's removing an entry from a successfulAttempt index. In KilledAfterSuccessTransition - createJobCounterUpdateEventTAFailed should be createJobCounterUpdateEventTAKilled TaskImpl.handleAttemptCompletion - finishedAttempts - this will end up double counting the same task attempt. It's used in some other transition. Does the JobHistoryParser need some more changes - to unset fields which may have been set previously by Map/ReduceAttemptSuccessfulEvents and TaskFinishedEvent For running tasks - shouldn't running Reduce attempts also be killed ? RMContainerAllocator.handleUpdatedNodes - instead of fetching the nodeId via appContext, job etc - the nodeId can be stored with the AssignedRequest. 1) getTask, getAttempt require readLocks - can avoid these calls every second. 2) There's an unlikely race where the nodeId may not be assigned in the TaskAttempt (if the dispatcher thread is backlogged). TaskAttemptId.getNodeId() can be avoided. getContainerManagerAddress can be used instead. Not related to this patch. Does JOB_TASK_COMPLETED need to be handled (ignored) in additional states?
        Hide
        Bikas Saha added a comment -

        Does 'node unhealthy' need to be treated differently from 'TooManyFetchFailures' ? Killed versus Failed. This ends up with NodeFailures not counting towards the limit on task attempts.

        Yes, I think. Based on our experience, here we are pre-emptively taking action on a task that might actually be ok. And it should be an infrequent action.

        JOB_UPDATED_NODES needs to be handled in the JOB_INIT state. Very small chance of hitting this.

        My understanding was that scheduling happens when the job moves from INIT to RUNNING state via the StartTransition(). Unless allocate is called on RM it will not return any unhealthy machines. So I thought that JOB_UPDATED_EVENT can never come until the job moves into the RUNNING state. Can you please point out the scenario you are thinking about?
        I can make the change for safety reasons, just in case.

        Minor: JobImpl.actOnUsableNode can get the task type from the id itself. It doesn't need to fetch the actual task.

        Unless you really want this, I would prefer it the way its currently written. I prefer not to depend on string name encodings.

        Minor: "if this attempt is not successful" this comment in JobImpl can be removed. It's removing an entry from a successfulAttempt index.

        That was a question I had and put it in the comments. It seems that for a TaskAttemptCompletedEventTransition the code removes the previous successful entry from successAttemptCompletionEventNoMap. It then checks if the current attempt is successful, and in that case adds it to the successAttemptCompletionEventNoMap. But what if the current attempt is not successful. We have now removed the previous successful attempt too. Is that the desired behavior. This question is independent of this jira.

        In KilledAfterSuccessTransition - createJobCounterUpdateEventTAFailed should be createJobCounterUpdateEventTAKilled

        Done.

        TaskImpl.handleAttemptCompletion - finishedAttempts - this will end up double counting the same task attempt. It's used in some other transition.

        I have moved the finishedTask increment out of that function and made it explicit in every transition that requires it to be that way.
        In the same context I have a question in comments in MapRetroactiveFailureTransition. Why is this not calling handleAttemptCompletion. My understanding is that handleAttemptCompletion is used to notify reducers about changes in map outputs. So if a map was failed after success then reducers should know about it so that they can abandon its outputs before getting too many fetch failures. Is that not so?

        Does the JobHistoryParser need some more changes - to unset fields which may have been set previously by Map/ReduceAttemptSuccessfulEvents and TaskFinishedEvent

        Done. Reset all fields set in handleTaskFinishedEvent. Others are already handled in the existing code.

        For running tasks - shouldn't running Reduce attempts also be killed ?

        My understanding of existing behavior in mrv1 was that only maps are pre-emptively terminated for performance reasons.

        RMContainerAllocator.handleUpdatedNodes - instead of fetching the nodeId via appContext, job etc - the nodeId can be stored with the AssignedRequest. 1) getTask, getAttempt require readLocks - can avoid these calls every second. 2) There's an unlikely race where the nodeId may not be assigned in the TaskAttempt (if the dispatcher thread is backlogged). TaskAttemptId.getNodeId() can be avoided. getContainerManagerAddress can be used instead.

        Sorry I did not find getContainerManagerAddress(). The map in AssignedRequests stores ContainerId and its not possible to get nodeId from it. What are you proposing?

        Not related to this patch. Does JOB_TASK_COMPLETED need to be handled (ignored) in additional states?

        It does not look like it but there may be race conditions I have not thought of. But looking further, it seems that the action on this event checks for job completion in TaskCompletedTransition. TaskCompletedTransition increments job.completedTaskCount irrespective of whether the task has succeeded/killed or failed. Now, TaskCompletedTransition.checkJobCompleteSuccess() checks job.completedTaskCount == job.tasks.size() for completion. How is this working? Wont enough killed tasks/failed + completed tasks trigger job completion? Or is that expected behavior?

        Show
        Bikas Saha added a comment - Does 'node unhealthy' need to be treated differently from 'TooManyFetchFailures' ? Killed versus Failed. This ends up with NodeFailures not counting towards the limit on task attempts. Yes, I think. Based on our experience, here we are pre-emptively taking action on a task that might actually be ok. And it should be an infrequent action. JOB_UPDATED_NODES needs to be handled in the JOB_INIT state. Very small chance of hitting this. My understanding was that scheduling happens when the job moves from INIT to RUNNING state via the StartTransition(). Unless allocate is called on RM it will not return any unhealthy machines. So I thought that JOB_UPDATED_EVENT can never come until the job moves into the RUNNING state. Can you please point out the scenario you are thinking about? I can make the change for safety reasons, just in case. Minor: JobImpl.actOnUsableNode can get the task type from the id itself. It doesn't need to fetch the actual task. Unless you really want this, I would prefer it the way its currently written. I prefer not to depend on string name encodings. Minor: "if this attempt is not successful" this comment in JobImpl can be removed. It's removing an entry from a successfulAttempt index. That was a question I had and put it in the comments. It seems that for a TaskAttemptCompletedEventTransition the code removes the previous successful entry from successAttemptCompletionEventNoMap. It then checks if the current attempt is successful, and in that case adds it to the successAttemptCompletionEventNoMap. But what if the current attempt is not successful. We have now removed the previous successful attempt too. Is that the desired behavior. This question is independent of this jira. In KilledAfterSuccessTransition - createJobCounterUpdateEventTAFailed should be createJobCounterUpdateEventTAKilled Done. TaskImpl.handleAttemptCompletion - finishedAttempts - this will end up double counting the same task attempt. It's used in some other transition. I have moved the finishedTask increment out of that function and made it explicit in every transition that requires it to be that way. In the same context I have a question in comments in MapRetroactiveFailureTransition. Why is this not calling handleAttemptCompletion. My understanding is that handleAttemptCompletion is used to notify reducers about changes in map outputs. So if a map was failed after success then reducers should know about it so that they can abandon its outputs before getting too many fetch failures. Is that not so? Does the JobHistoryParser need some more changes - to unset fields which may have been set previously by Map/ReduceAttemptSuccessfulEvents and TaskFinishedEvent Done. Reset all fields set in handleTaskFinishedEvent. Others are already handled in the existing code. For running tasks - shouldn't running Reduce attempts also be killed ? My understanding of existing behavior in mrv1 was that only maps are pre-emptively terminated for performance reasons. RMContainerAllocator.handleUpdatedNodes - instead of fetching the nodeId via appContext, job etc - the nodeId can be stored with the AssignedRequest. 1) getTask, getAttempt require readLocks - can avoid these calls every second. 2) There's an unlikely race where the nodeId may not be assigned in the TaskAttempt (if the dispatcher thread is backlogged). TaskAttemptId.getNodeId() can be avoided. getContainerManagerAddress can be used instead. Sorry I did not find getContainerManagerAddress(). The map in AssignedRequests stores ContainerId and its not possible to get nodeId from it. What are you proposing? Not related to this patch. Does JOB_TASK_COMPLETED need to be handled (ignored) in additional states? It does not look like it but there may be race conditions I have not thought of. But looking further, it seems that the action on this event checks for job completion in TaskCompletedTransition. TaskCompletedTransition increments job.completedTaskCount irrespective of whether the task has succeeded/killed or failed. Now, TaskCompletedTransition.checkJobCompleteSuccess() checks job.completedTaskCount == job.tasks.size() for completion. How is this working? Wont enough killed tasks/failed + completed tasks trigger job completion? Or is that expected behavior?
        Hide
        Bikas Saha added a comment -

        added new patch for comments above.

        Show
        Bikas Saha added a comment - added new patch for comments above.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12524035/MAPREDUCE-3921-5.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified test files.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        -1 javac. The applied patch generated 494 javac compiler warnings (more than the trunk's current 492 warnings).

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.yarn.server.TestContainerManagerSecurity
        org.apache.hadoop.yarn.server.resourcemanager.security.TestApplicationTokens
        org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService
        org.apache.hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry
        org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization
        org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs
        org.apache.hadoop.mapred.TestClientRedirect
        org.apache.hadoop.mapreduce.security.TestJHSSecurity

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2302//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2302//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12524035/MAPREDUCE-3921-5.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified test files. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 494 javac compiler warnings (more than the trunk's current 492 warnings). +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.yarn.server.TestContainerManagerSecurity org.apache.hadoop.yarn.server.resourcemanager.security.TestApplicationTokens org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService org.apache.hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs org.apache.hadoop.mapred.TestClientRedirect org.apache.hadoop.mapreduce.security.TestJHSSecurity +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2302//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2302//console This message is automatically generated.
        Hide
        Siddharth Seth added a comment -

        Yes, I think. Based on our experience, here we are pre-emptively taking action on a task that might actually be ok. And it should be an infrequent action.

        My understanding of existing behavior in mrv1 was that only maps are pre-emptively terminated for performance reasons.

        I think 'fetch failure' / 'node unhealthy' should be considered in the same way - at least for the purpose of counting towards the allowed_task_failure limit. Ideally for the tasks state as well. There's currently no way to distinguish between a task causing a node to go unhealthy versus other problems. My guess is 'fetch failures' are more often than not caused by a bad tracker, as against a bad task.
        WRT killing reduce tasks on unhealthy node - I'm not sure what was done in 20 (From a quick look, couldn't find the code which kills map tasks either). It'd be best if Vinod or others with more knowledge and history about how and why 20 deals with this pitch in.

        My understanding was that scheduling happens when the job moves from INIT to RUNNING state via the StartTransition(). Unless allocate is called on RM it will not return any unhealthy machines. So I thought that JOB_UPDATED_EVENT can never come until the job moves into the RUNNING state. Can you please point out the scenario you are thinking about?

        Calls to allocate() start once the RMCommunicator service is started - which happens before a JOB_START event is sent. Very unlikely - but there's an extremely remote possibility of an allocate call completing before a job moves into the START state.

        Unless you really want this, I would prefer it the way its currently written. I prefer not to depend on string name encodings.

        It's safe to use TaskId.getTaskType() - don't need to explicitly depend on string name encoding. Avoids the extra task lookups.

        That was a question I had and put it in the comments. It seems that for a TaskAttemptCompletedEventTransition the code removes the previous successful entry from successAttemptCompletionEventNoMap. It then checks if the current attempt is successful, and in that case adds it to the successAttemptCompletionEventNoMap. But what if the current attempt is not successful. We have now removed the previous successful attempt too. Is that the desired behavior. This question is independent of this jira.

        It also marks the removed entry as OBSOLETE - so the taskAttemptCompletionEvents list doesn't have any SUCCESSFUL attempts for the specific taskId.

        I have moved the finishedTask increment out of that function and made it explicit in every transition that requires it to be that way.

        In the same context I have a question in comments in MapRetroactiveFailureTransition. Why is this not calling handleAttemptCompletion. My understanding is that handleAttemptCompletion is used to notify reducers about changes in map outputs. So if a map was failed after success then reducers should know about it so that they can abandon its outputs before getting too many fetch failures. Is that not so?
        It is calling it via AttemptFailedTransition.transition(). That's the bit which also counts the failure towards the allowed_failure_limit.

        Sorry I did not find getContainerManagerAddress(). The map in AssignedRequests stores ContainerId and its not possible to get nodeId from it. What are you proposing?

        Correction - called getAssignedContainerMgrAddress. IAC, was proposing storing the containers NodeId with the AssignedRequest - that completely removes the need to fetch the actual task.

        It does not look like it but there may be race conditions I have not thought of. But looking further, it seems that the action on this event checks for job completion in TaskCompletedTransition. TaskCompletedTransition increments job.completedTaskCount irrespective of whether the task has succeeded/killed or failed. Now, TaskCompletedTransition.checkJobCompleteSuccess() checks job.completedTaskCount == job.tasks.size() for completion. How is this working? Wont enough killed tasks/failed + completed tasks trigger job completion? Or is that expected behavior?

        It checks for failure before attempting the SUCCESS check - so that should work. Unless I'm missing something - Tasks could complete after a Job moves to state FAILED - which would end up generating this event.

        Show
        Siddharth Seth added a comment - Yes, I think. Based on our experience, here we are pre-emptively taking action on a task that might actually be ok. And it should be an infrequent action. My understanding of existing behavior in mrv1 was that only maps are pre-emptively terminated for performance reasons. I think 'fetch failure' / 'node unhealthy' should be considered in the same way - at least for the purpose of counting towards the allowed_task_failure limit. Ideally for the tasks state as well. There's currently no way to distinguish between a task causing a node to go unhealthy versus other problems. My guess is 'fetch failures' are more often than not caused by a bad tracker, as against a bad task. WRT killing reduce tasks on unhealthy node - I'm not sure what was done in 20 (From a quick look, couldn't find the code which kills map tasks either). It'd be best if Vinod or others with more knowledge and history about how and why 20 deals with this pitch in. My understanding was that scheduling happens when the job moves from INIT to RUNNING state via the StartTransition(). Unless allocate is called on RM it will not return any unhealthy machines. So I thought that JOB_UPDATED_EVENT can never come until the job moves into the RUNNING state. Can you please point out the scenario you are thinking about? Calls to allocate() start once the RMCommunicator service is started - which happens before a JOB_START event is sent. Very unlikely - but there's an extremely remote possibility of an allocate call completing before a job moves into the START state. Unless you really want this, I would prefer it the way its currently written. I prefer not to depend on string name encodings. It's safe to use TaskId.getTaskType() - don't need to explicitly depend on string name encoding. Avoids the extra task lookups. That was a question I had and put it in the comments. It seems that for a TaskAttemptCompletedEventTransition the code removes the previous successful entry from successAttemptCompletionEventNoMap. It then checks if the current attempt is successful, and in that case adds it to the successAttemptCompletionEventNoMap. But what if the current attempt is not successful. We have now removed the previous successful attempt too. Is that the desired behavior. This question is independent of this jira. It also marks the removed entry as OBSOLETE - so the taskAttemptCompletionEvents list doesn't have any SUCCESSFUL attempts for the specific taskId. I have moved the finishedTask increment out of that function and made it explicit in every transition that requires it to be that way. In the same context I have a question in comments in MapRetroactiveFailureTransition. Why is this not calling handleAttemptCompletion. My understanding is that handleAttemptCompletion is used to notify reducers about changes in map outputs. So if a map was failed after success then reducers should know about it so that they can abandon its outputs before getting too many fetch failures. Is that not so? It is calling it via AttemptFailedTransition.transition(). That's the bit which also counts the failure towards the allowed_failure_limit. Sorry I did not find getContainerManagerAddress(). The map in AssignedRequests stores ContainerId and its not possible to get nodeId from it. What are you proposing? Correction - called getAssignedContainerMgrAddress. IAC, was proposing storing the containers NodeId with the AssignedRequest - that completely removes the need to fetch the actual task. It does not look like it but there may be race conditions I have not thought of. But looking further, it seems that the action on this event checks for job completion in TaskCompletedTransition. TaskCompletedTransition increments job.completedTaskCount irrespective of whether the task has succeeded/killed or failed. Now, TaskCompletedTransition.checkJobCompleteSuccess() checks job.completedTaskCount == job.tasks.size() for completion. How is this working? Wont enough killed tasks/failed + completed tasks trigger job completion? Or is that expected behavior? It checks for failure before attempting the SUCCESS check - so that should work. Unless I'm missing something - Tasks could complete after a Job moves to state FAILED - which would end up generating this event.
        Hide
        Bikas Saha added a comment -

        I see what you are saying about fetch failures and bad nodes. I am open to both approaches. The way its currently done is based off discussions I had with Vinod long ago.

        Changed JobImpl to ignore JOB_UPDATED_NODE events in NEW and INITED states.

        Changed to TaskId.getTaskType()

        Yes, the removed entry is marked OBSOLETE. But I still dont understand why that would be done if the current entry is not successful. Why lose the previously successful entry when the current one is not successful itself? This should be done only when the current entry is successful.

        I like the change to store NodeId's instead of ContainerId's in the AssignedRequests map. I would like to make it a separate change and not merge it with this one. There might be other gotchas to doing that.

        Show
        Bikas Saha added a comment - I see what you are saying about fetch failures and bad nodes. I am open to both approaches. The way its currently done is based off discussions I had with Vinod long ago. Changed JobImpl to ignore JOB_UPDATED_NODE events in NEW and INITED states. Changed to TaskId.getTaskType() Yes, the removed entry is marked OBSOLETE. But I still dont understand why that would be done if the current entry is not successful. Why lose the previously successful entry when the current one is not successful itself? This should be done only when the current entry is successful. I like the change to store NodeId's instead of ContainerId's in the AssignedRequests map. I would like to make it a separate change and not merge it with this one. There might be other gotchas to doing that.
        Hide
        Bikas Saha added a comment -

        Review comments fixed.

        Show
        Bikas Saha added a comment - Review comments fixed.
        Hide
        Siddharth Seth added a comment -

        Yes, the removed entry is marked OBSOLETE. But I still dont understand why that would be done if the current entry is not successful. Why lose the previously successful entry when the current one is not successful itself? This should be done only when the current entry is successful.

        This list is sent over to reduce tasks - which do consider the OBSOLETE state when deciding on which map outputs need to be fetched.

        I like the change to store NodeId's instead of ContainerId's in the AssignedRequests map. I would like to make it a separate change and not merge it with this one. There might be other gotchas to doing that.

        Both NodeId and ContainerId can be stored (I believe containerId is required). That should be a reasonably simple change - and will allow TA_KILLs to be sent directly.

        Show
        Siddharth Seth added a comment - Yes, the removed entry is marked OBSOLETE. But I still dont understand why that would be done if the current entry is not successful. Why lose the previously successful entry when the current one is not successful itself? This should be done only when the current entry is successful. This list is sent over to reduce tasks - which do consider the OBSOLETE state when deciding on which map outputs need to be fetched. I like the change to store NodeId's instead of ContainerId's in the AssignedRequests map. I would like to make it a separate change and not merge it with this one. There might be other gotchas to doing that. Both NodeId and ContainerId can be stored (I believe containerId is required). That should be a reasonably simple change - and will allow TA_KILLs to be sent directly.
        Hide
        Bikas Saha added a comment -

        I will change the map to store Container directly. Return ContainerId and NodeId in the getters from the container object.

        About the OBSOLETE part. I get how it is used. What I dont get is why we are marking a previously successful task as obsolete and invalid upon the completion of a new task without first checking if the new task was itself successful or not.

        Show
        Bikas Saha added a comment - I will change the map to store Container directly. Return ContainerId and NodeId in the getters from the container object. About the OBSOLETE part. I get how it is used. What I dont get is why we are marking a previously successful task as obsolete and invalid upon the completion of a new task without first checking if the new task was itself successful or not.
        Hide
        Bikas Saha added a comment -

        Changed assignedRequests maps to use Container as the value instead of ContainerId

        Show
        Bikas Saha added a comment - Changed assignedRequests maps to use Container as the value instead of ContainerId
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12524088/MAPREDUCE-3921-7.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified test files.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        -1 javac. The applied patch generated 494 javac compiler warnings (more than the trunk's current 492 warnings).

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.yarn.server.TestContainerManagerSecurity
        org.apache.hadoop.yarn.server.resourcemanager.security.TestApplicationTokens
        org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService
        org.apache.hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry
        org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization
        org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs
        org.apache.hadoop.mapred.TestClientRedirect
        org.apache.hadoop.mapreduce.TestYarnClientProtocolProvider
        org.apache.hadoop.mapreduce.security.TestJHSSecurity

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2305//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2305//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12524088/MAPREDUCE-3921-7.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified test files. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 494 javac compiler warnings (more than the trunk's current 492 warnings). +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.yarn.server.TestContainerManagerSecurity org.apache.hadoop.yarn.server.resourcemanager.security.TestApplicationTokens org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService org.apache.hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs org.apache.hadoop.mapred.TestClientRedirect org.apache.hadoop.mapreduce.TestYarnClientProtocolProvider org.apache.hadoop.mapreduce.security.TestJHSSecurity +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2305//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2305//console This message is automatically generated.
        Hide
        Siddharth Seth added a comment -

        Thanks for the updated patch Bikas. Will take a look. Still waiting for input from the MR veterans on some of the previous comments - how things were handled in 20 - specifically for killing map/reduce tasks on unhealthy nodes, and treating 'node unhealthy' similar to 'fetch failure' (State Killed / Failed as well as counting towards max_attempts).

        About the OBSOLETE part. I get how it is used. What I dont get is why we are marking a previously successful task as obsolete and invalid upon the completion of a new task without first checking if the new task was itself successful or not.

        Are you considering leaving the task in SUCCESSFUL state, even if it's being retried, so that the Reduce may be able to pull data - before there's a new SUCCESSFUL attempt ?
        Otherwise, marking the attempt as OBSOLETE and removing the task from successAttemptCompletionEventNoMap (tracks only SUCCESSUL attempts) seems like the correct thing to do.

        Show
        Siddharth Seth added a comment - Thanks for the updated patch Bikas. Will take a look. Still waiting for input from the MR veterans on some of the previous comments - how things were handled in 20 - specifically for killing map/reduce tasks on unhealthy nodes, and treating 'node unhealthy' similar to 'fetch failure' (State Killed / Failed as well as counting towards max_attempts). About the OBSOLETE part. I get how it is used. What I dont get is why we are marking a previously successful task as obsolete and invalid upon the completion of a new task without first checking if the new task was itself successful or not. Are you considering leaving the task in SUCCESSFUL state, even if it's being retried, so that the Reduce may be able to pull data - before there's a new SUCCESSFUL attempt ? Otherwise, marking the attempt as OBSOLETE and removing the task from successAttemptCompletionEventNoMap (tracks only SUCCESSUL attempts) seems like the correct thing to do.
        Hide
        Bikas Saha added a comment -

        Here is how I read TaskAttemptCompletedEventTransition in JobImpl
        Task T has a new attempt completion event T2.
        At this point T1 is a successful attempt that has been recorded in successAttemptCompletionEventNoMap. So it is still a valid successful event.
        T is removed from successAttemptCompletionEventNoMap and its T1 taskAttemptCompletionEvents entry is marked obsolete.
        Now T2 status is checked and if successful, T is added to successAttemptCompletionEventNoMap.

        This means while retry T2 was running, T was considered successful because it was in successAttemptCompletionEventNoMap.
        1) So if we want to not leave a task as successful when being retried, then T should already have been removed from successAttemptCompletionEventNoMap and T1 marked obsolete. Removing T after completion of T2 is not correct.
        2) However, if we want T to remain successful until we have another successful attempt, then it should be removed from successAttemptCompletionEventNoMap only when T2 is successful. But currently we remove T from successAttemptCompletionEventNoMap regardless of T2's status.

        Show
        Bikas Saha added a comment - Here is how I read TaskAttemptCompletedEventTransition in JobImpl Task T has a new attempt completion event T2. At this point T1 is a successful attempt that has been recorded in successAttemptCompletionEventNoMap. So it is still a valid successful event. T is removed from successAttemptCompletionEventNoMap and its T1 taskAttemptCompletionEvents entry is marked obsolete. Now T2 status is checked and if successful, T is added to successAttemptCompletionEventNoMap. This means while retry T2 was running, T was considered successful because it was in successAttemptCompletionEventNoMap. 1) So if we want to not leave a task as successful when being retried, then T should already have been removed from successAttemptCompletionEventNoMap and T1 marked obsolete. Removing T after completion of T2 is not correct. 2) However, if we want T to remain successful until we have another successful attempt, then it should be removed from successAttemptCompletionEventNoMap only when T2 is successful. But currently we remove T from successAttemptCompletionEventNoMap regardless of T2's status.
        Hide
        Bikas Saha added a comment -

        Guys,
        Would it be OK to commit this and create a new jira, if required, based on the historical behavior info you need.

        Show
        Bikas Saha added a comment - Guys, Would it be OK to commit this and create a new jira, if required, based on the historical behavior info you need.
        Hide
        Arun C Murthy added a comment -

        Sorry to come in late.
        Some clarifications:

        1. MR1 JT kills all running tasks on a TT when it's deemed 'lost'.
        2. It also kills all completed maps on that TT for 'active' jobs.
        3. The tasks are marked KILLED rather than FAILED and thus don't count towards the job, which is correct since it wasn't the job's fault.

        Hope this helps.

        Show
        Arun C Murthy added a comment - Sorry to come in late. Some clarifications: MR1 JT kills all running tasks on a TT when it's deemed 'lost'. It also kills all completed maps on that TT for 'active' jobs. The tasks are marked KILLED rather than FAILED and thus don't count towards the job, which is correct since it wasn't the job's fault. Hope this helps.
        Hide
        Bikas Saha added a comment -

        Adding patch in which AM sends kill event to map and reduce tasks to keep behavior similar to JT in mrv1.
        Thing to note, is that the RM also terminates such containers. However, in AM such RM terminations mark the task attempt as FAILED but in this case we need to mark them as KILLED. So it is necessary to send the kill event in the AM so that it pre-empts the state transition to FAILED in the normal handling of such cases.

        Show
        Bikas Saha added a comment - Adding patch in which AM sends kill event to map and reduce tasks to keep behavior similar to JT in mrv1. Thing to note, is that the RM also terminates such containers. However, in AM such RM terminations mark the task attempt as FAILED but in this case we need to mark them as KILLED. So it is necessary to send the kill event in the AM so that it pre-empts the state transition to FAILED in the normal handling of such cases.
        Hide
        Siddharth Seth added a comment -

        Submitting to jenkins.

        Minor stuff
        In TaskAttemptImpl, createJobCounterUpdateEventTAKilled - SLOTS_MILLIS_MAPS shouldn't be updated if a task_attempt is transitioning from SUCCEEDED to FAILED.
        Some minor formatting changes required in RMContainerAllocator (spacing in the for loop). Also the warnings in the same class.
        Otherwise, lgtm.

        Bobby, the patch doesn't apply to the 23 branch. Has a dependency on MAPREDUCE-3958. Do you want to pull that in to 23 as well ?

        Show
        Siddharth Seth added a comment - Submitting to jenkins. Minor stuff In TaskAttemptImpl, createJobCounterUpdateEventTAKilled - SLOTS_MILLIS_MAPS shouldn't be updated if a task_attempt is transitioning from SUCCEEDED to FAILED. Some minor formatting changes required in RMContainerAllocator (spacing in the for loop). Also the warnings in the same class. Otherwise, lgtm. Bobby, the patch doesn't apply to the 23 branch. Has a dependency on MAPREDUCE-3958 . Do you want to pull that in to 23 as well ?
        Hide
        Bikas Saha added a comment -

        Fixed the counters. Also fixed it for FAILED transition that had the same issue.
        Suppressed the unchecked warning.

        Show
        Bikas Saha added a comment - Fixed the counters. Also fixed it for FAILED transition that had the same issue. Suppressed the unchecked warning.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12531492/MAPREDUCE-3921-10.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2448//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2448//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12531492/MAPREDUCE-3921-10.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2448//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2448//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12531700/MAPREDUCE-3921-11.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2450//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2450//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12531700/MAPREDUCE-3921-11.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2450//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2450//console This message is automatically generated.
        Hide
        Siddharth Seth added a comment -

        +1 lgtm. Thanks Bikas.

        Show
        Siddharth Seth added a comment - +1 lgtm. Thanks Bikas.
        Hide
        Siddharth Seth added a comment -

        Committed to trunk and branch-2.

        Show
        Siddharth Seth added a comment - Committed to trunk and branch-2.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #2364 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2364/)
        MAPREDUCE-3921. MR AM should act on node health status changes. Contributed by Bikas Saha. (Revision 1349065)

        Result = SUCCESS
        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349065
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobEventType.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobUpdatedNodesEvent.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/TaskAttemptKillEvent.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeState.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #2364 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2364/ ) MAPREDUCE-3921 . MR AM should act on node health status changes. Contributed by Bikas Saha. (Revision 1349065) Result = SUCCESS sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349065 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobEventType.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobUpdatedNodesEvent.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/TaskAttemptKillEvent.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeState.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk-Commit #2342 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2342/)
        MAPREDUCE-3921. MR AM should act on node health status changes. Contributed by Bikas Saha. (Revision 1349065)

        Result = SUCCESS
        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349065
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobEventType.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobUpdatedNodesEvent.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/TaskAttemptKillEvent.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeState.java
        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #2342 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2342/ ) MAPREDUCE-3921 . MR AM should act on node health status changes. Contributed by Bikas Saha. (Revision 1349065) Result = SUCCESS sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349065 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobEventType.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobUpdatedNodesEvent.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/TaskAttemptKillEvent.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeState.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #2415 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2415/)
        MAPREDUCE-3921. MR AM should act on node health status changes. Contributed by Bikas Saha. (Revision 1349065)

        Result = SUCCESS
        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349065
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobEventType.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobUpdatedNodesEvent.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/TaskAttemptKillEvent.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeState.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #2415 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2415/ ) MAPREDUCE-3921 . MR AM should act on node health status changes. Contributed by Bikas Saha. (Revision 1349065) Result = SUCCESS sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349065 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobEventType.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobUpdatedNodesEvent.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/TaskAttemptKillEvent.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeState.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #1074 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1074/)
        MAPREDUCE-3921. MR AM should act on node health status changes. Contributed by Bikas Saha. (Revision 1349065)

        Result = SUCCESS
        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349065
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobEventType.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobUpdatedNodesEvent.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/TaskAttemptKillEvent.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeState.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1074 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1074/ ) MAPREDUCE-3921 . MR AM should act on node health status changes. Contributed by Bikas Saha. (Revision 1349065) Result = SUCCESS sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349065 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobEventType.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobUpdatedNodesEvent.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/TaskAttemptKillEvent.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeState.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #1107 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1107/)
        MAPREDUCE-3921. MR AM should act on node health status changes. Contributed by Bikas Saha. (Revision 1349065)

        Result = FAILURE
        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349065
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobEventType.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobUpdatedNodesEvent.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/TaskAttemptKillEvent.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeState.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1107 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1107/ ) MAPREDUCE-3921 . MR AM should act on node health status changes. Contributed by Bikas Saha. (Revision 1349065) Result = FAILURE sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349065 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobEventType.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobUpdatedNodesEvent.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/TaskAttemptKillEvent.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeState.java
        Hide
        Kihwal Lee added a comment -

        It has introduced an issue, MAPREDUCE-4376, which was exposed by TestClusterMRNotification.

        Show
        Kihwal Lee added a comment - It has introduced an issue, MAPREDUCE-4376 , which was exposed by TestClusterMRNotification.

          People

          • Assignee:
            Bikas Saha
            Reporter:
            Vinod Kumar Vavilapalli
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development