Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-2997

NM keeps sending already-sent completed containers to RM until containers are removed from context

    Details

    • Hadoop Flags:
      Reviewed

      Description

      We have seen in RM log a lot of

      INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed...

      It is caused by NM sending completed containers repeatedly until the app is finished. On the RM side, the container is already released, hence getRMContainer returns null.

      1. YARN-2997.2.patch
        9 kB
        Chengbing Liu
      2. YARN-2997.3.patch
        13 kB
        Chengbing Liu
      3. YARN-2997.4.patch
        12 kB
        Chengbing Liu
      4. YARN-2997.5.patch
        12 kB
        Chengbing Liu
      5. YARN-2997.patch
        4 kB
        Chengbing Liu

        Activity

        Hide
        chengbing.liu Chengbing Liu added a comment -

        Report to RM only once by not calling containerStatuses.add(containerStatus); from the second time on.

        Tested on a real cluster and it works well.

        Show
        chengbing.liu Chengbing Liu added a comment - Report to RM only once by not calling containerStatuses.add(containerStatus); from the second time on. Tested on a real cluster and it works well.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12689447/YARN-2997.patch
        against trunk revision 249cc90.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

        org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater

        Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6215//testReport/
        Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6215//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12689447/YARN-2997.patch against trunk revision 249cc90. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6215//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6215//console This message is automatically generated.
        Hide
        kasha Karthik Kambatla added a comment -

        With work-preserving restart, the NM is required to intimate the RM repeatedly in case the RM goes down and loses this information. I propose we ignore the latter updates, or add code to identify them duplicates and then ignore.

        Show
        kasha Karthik Kambatla added a comment - With work-preserving restart, the NM is required to intimate the RM repeatedly in case the RM goes down and loses this information. I propose we ignore the latter updates, or add code to identify them duplicates and then ignore.
        Hide
        jianhe Jian He added a comment -

        add code to identify them duplicates and then ignore.

        I think it is now ignored. perhaps we should clarify the logging and move it to debug level.

        Show
        jianhe Jian He added a comment - add code to identify them duplicates and then ignore. I think it is now ignored. perhaps we should clarify the logging and move it to debug level.
        Hide
        chengbing.liu Chengbing Liu added a comment -

        Karthik Kambatla Jian He Thanks for your review. I think NM will call getNMContainerStatuses() instead of getContainerStatuses() upon receiving RESYNC from restarted RM. getNMContainerStatuses() remains unchanged and still reports all the completed containers for non-completed apps. The uploaded patch will let the normal heartbeat (not after receiving RESYNC) send only useful container status information to RM. So I guess the work-preserving RM restart is not affected.

        Show
        chengbing.liu Chengbing Liu added a comment - Karthik Kambatla Jian He Thanks for your review. I think NM will call getNMContainerStatuses() instead of getContainerStatuses() upon receiving RESYNC from restarted RM. getNMContainerStatuses() remains unchanged and still reports all the completed containers for non-completed apps. The uploaded patch will let the normal heartbeat (not after receiving RESYNC) send only useful container status information to RM. So I guess the work-preserving RM restart is not affected.
        Hide
        jianhe Jian He added a comment -

        The uploaded patch will let the normal heartbeat

        The intention was to let NM remove containers from its context only after RM acks it has received these containers. More context in YARN-1372

        Show
        jianhe Jian He added a comment - The uploaded patch will let the normal heartbeat The intention was to let NM remove containers from its context only after RM acks it has received these containers. More context in YARN-1372
        Hide
        chengbing.liu Chengbing Liu added a comment -

        Jian He The containers are not removed in this patch, they are just not reported to RM when the following conditions are met:

        • The application is not finished
        • The container was completed and was already in recentlyStoppedContainers
        • It is a normal heartbeat with RM, not after RM restart

        Note that the container is not removed from the NM context. In a resync with RM, these completed applications will still be reported for work-preserving recovery.

        Show
        chengbing.liu Chengbing Liu added a comment - Jian He The containers are not removed in this patch, they are just not reported to RM when the following conditions are met: The application is not finished The container was completed and was already in recentlyStoppedContainers It is a normal heartbeat with RM, not after RM restart Note that the container is not removed from the NM context. In a resync with RM, these completed applications will still be reported for work-preserving recovery.
        Hide
        chengbing.liu Chengbing Liu added a comment -

        Updated patch.

        It handles the following issues:

        • If a container is completed, and the corresponding application is still running, the NM will send duplicated reports about the container, which is unnecesary.
        • Currently, if a heartbeat with RM and NM is lost, while the NM is sending a completed container status whose application is in finished state, it will not send again. In the updated patch, the NM will store all the completed container statuses and resend them after a lost heartbeat.
        • Some test cases are is fixed based on the above considerations.

        Please help review the patch, thanks!

        Show
        chengbing.liu Chengbing Liu added a comment - Updated patch. It handles the following issues: If a container is completed, and the corresponding application is still running, the NM will send duplicated reports about the container, which is unnecesary. Currently, if a heartbeat with RM and NM is lost, while the NM is sending a completed container status whose application is in finished state, it will not send again. In the updated patch, the NM will store all the completed container statuses and resend them after a lost heartbeat. Some test cases are is fixed based on the above considerations. Please help review the patch, thanks!
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12689672/YARN-2997.2.patch
        against trunk revision e7257ac.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

        Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6226//testReport/
        Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6226//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12689672/YARN-2997.2.patch against trunk revision e7257ac. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6226//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6226//console This message is automatically generated.
        Hide
        jianhe Jian He added a comment -

        Chengbing Liu, thanks for your explanation ! patch looks good overall, few comments:

        • for simplicity, we can use the addAll method for the for loop.
          for (ContainerStatus containerStatus : pendingCompletedContainers.values()) {
                containerStatuses.add(containerStatus);
              }
          
        • pendingCompletedContainers, maybe use a set instead of a map?
        • pendingCompletedContainers.remove(containerId); this line may be not needed, given pendingCompletedContainers.clear() is invoked earlier
        • I found pendingContainersToRemove potentially has a leak, we should probably add following in the while loop of removeOrTrackCompletedContainersFromContext, would you mind fixing this too ?
                if (nmContainer == null) {
                  iter.remove();
                }
          
        • could you add code comments on the modified test cases, so that people can reason more easily ? thx
                  if (heartBeatID == 2) {
                      Assert.assertEquals(statuses.size(), 4);
                    } else {
                      Assert.assertEquals(statuses.size(), 2);
                    }
          
        Show
        jianhe Jian He added a comment - Chengbing Liu , thanks for your explanation ! patch looks good overall, few comments: for simplicity, we can use the addAll method for the for loop. for (ContainerStatus containerStatus : pendingCompletedContainers.values()) { containerStatuses.add(containerStatus); } pendingCompletedContainers, maybe use a set instead of a map? pendingCompletedContainers.remove(containerId); this line may be not needed, given pendingCompletedContainers.clear() is invoked earlier I found pendingContainersToRemove potentially has a leak, we should probably add following in the while loop of removeOrTrackCompletedContainersFromContext, would you mind fixing this too ? if (nmContainer == null ) { iter.remove(); } could you add code comments on the modified test cases, so that people can reason more easily ? thx if (heartBeatID == 2) { Assert.assertEquals(statuses.size(), 4); } else { Assert.assertEquals(statuses.size(), 2); }
        Hide
        chengbing.liu Chengbing Liu added a comment -

        for simplicity, we can use the addAll method for the for loop.

        Yes, I will change this.

        pendingCompletedContainers, maybe use a set instead of a map?

        I'm not sure if ContainerStatus can be compared, because I didn't see an equals method defined in the abstract class ContainerStatus.

        pendingCompletedContainers.remove(containerId); this line may be not needed

        I added this line in method removeOrTrackCompletedContainersFromContext after I discovered the method is called independently in the test testRemovePreviousCompletedContainersFromContext. The test first calls removeOrTrackCompletedContainersFromContext then getContainerStatuses, and expects the container status to be removed from the result. So I guess we have to keep it?

        I found pendingContainersToRemove potentially has a leak,

        Yes you are right, I will fix this and add comments for modified test cases.

        Show
        chengbing.liu Chengbing Liu added a comment - for simplicity, we can use the addAll method for the for loop. Yes, I will change this. pendingCompletedContainers, maybe use a set instead of a map? I'm not sure if ContainerStatus can be compared, because I didn't see an equals method defined in the abstract class ContainerStatus . pendingCompletedContainers.remove(containerId); this line may be not needed I added this line in method removeOrTrackCompletedContainersFromContext after I discovered the method is called independently in the test testRemovePreviousCompletedContainersFromContext . The test first calls removeOrTrackCompletedContainersFromContext then getContainerStatuses , and expects the container status to be removed from the result. So I guess we have to keep it? I found pendingContainersToRemove potentially has a leak, Yes you are right, I will fix this and add comments for modified test cases.
        Hide
        jianhe Jian He added a comment -

        I think we can simplify the logic in getContainerStatuses as such:

              if (containerStatus.getState() == ContainerState.COMPLETE) {
                if (!isContainerRecentlyStopped(containerId)) {
                  addCompletedContainer(containerId);
                  containerStatuses.add(containerStatus);
                }
              } else {
                containerStatuses.add(containerStatus);
              }
        

        I didn't see an equals method defined in the abstract class

        the sub class has the equal method.

        So I guess we have to keep it

        that's limitation of the test, we should fix the tests.

        Show
        jianhe Jian He added a comment - I think we can simplify the logic in getContainerStatuses as such: if (containerStatus.getState() == ContainerState.COMPLETE) { if (!isContainerRecentlyStopped(containerId)) { addCompletedContainer(containerId); containerStatuses.add(containerStatus); } } else { containerStatuses.add(containerStatus); } I didn't see an equals method defined in the abstract class the sub class has the equal method. So I guess we have to keep it that's limitation of the test, we should fix the tests.
        Hide
        chengbing.liu Chengbing Liu added a comment -

        I think we can simplify the logic in getContainerStatuses as such:

        It seems that if we do not remove the containers whose app is already stopped, we will rely on the heartbeat response from RM to remove containers acked by AM. If something goes wrong on the AM or RM side, the NM will never remove these containers from context. So in my opinion, that could be a potential leak.

        the sub class has the equal method.

        Yes, you are right. However, I'm still not sure if it is a good idea to use Set<ContainerStatus> instead of Map<ContainerId, ContainerStatus> for the following reasons:

        • ContainerId is a unique identifier for a container, while ContainerStatus can be changed over time, even for the same container.
        • We want to ensure no duplicate container status reported to RM. ContainerStatus has not only containerId, but also container state, exit status and diagnostic message, we may run into a situation where we report two different ContainerStatus with same ID and different states or other stuffs.
        • ContainerId has equals method and annotated as public and stable, while ContainerStatus has no equals method and ContainerStatusPBImpl is annotated as private and unstable. It may not be a good idea to rely on the implementation of ContainerStatus.
        • The use Set<ContainerStatus> never appears in the current code base.

        that's limitation of the test, we should fix the tests.

        Yes, I see. I will fix them.

        Show
        chengbing.liu Chengbing Liu added a comment - I think we can simplify the logic in getContainerStatuses as such: It seems that if we do not remove the containers whose app is already stopped, we will rely on the heartbeat response from RM to remove containers acked by AM. If something goes wrong on the AM or RM side, the NM will never remove these containers from context. So in my opinion, that could be a potential leak. the sub class has the equal method. Yes, you are right. However, I'm still not sure if it is a good idea to use Set<ContainerStatus> instead of Map<ContainerId, ContainerStatus> for the following reasons: ContainerId is a unique identifier for a container, while ContainerStatus can be changed over time, even for the same container. We want to ensure no duplicate container status reported to RM. ContainerStatus has not only containerId, but also container state, exit status and diagnostic message, we may run into a situation where we report two different ContainerStatus with same ID and different states or other stuffs. ContainerId has equals method and annotated as public and stable, while ContainerStatus has no equals method and ContainerStatusPBImpl is annotated as private and unstable. It may not be a good idea to rely on the implementation of ContainerStatus . The use Set<ContainerStatus> never appears in the current code base. that's limitation of the test, we should fix the tests. Yes, I see. I will fix them.
        Hide
        chengbing.liu Chengbing Liu added a comment -

        Updated patch:

        • fix potential pendingContainersToRemove leak.
        • remove unnecessary pendingCompletedContainers.clear(); and add clearPendingCompletedContainers() for testing purpose only.
        • Add comments for modified tests.
        • Switch order of assertEquals. Expected value should come first to prevent confusions.
        Show
        chengbing.liu Chengbing Liu added a comment - Updated patch: fix potential pendingContainersToRemove leak. remove unnecessary pendingCompletedContainers.clear(); and add clearPendingCompletedContainers() for testing purpose only. Add comments for modified tests. Switch order of assertEquals . Expected value should come first to prevent confusions.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12689964/YARN-2997.3.patch
        against trunk revision 947578c.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

        Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6238//testReport/
        Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6238//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12689964/YARN-2997.3.patch against trunk revision 947578c. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6238//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6238//console This message is automatically generated.
        Hide
        jianhe Jian He added a comment -

        we may run into a situation where we report two different ContainerStatus with same ID

        I think this is not possible given that we are looping this.context.getContainers() which is based on containerId to Container map. Or we can just use a list.

        So in my opinion, that could be a potential leak.

        I see. then we should send the pendingCompletedContainers in getNMContainerStatuses method too. and pendingCompletedContainers.clear(); should be put after {{if (response.getNodeAction() == NodeAction.RESYNC) }}, or we can just put it at the last line of removeOrTrackCompletedContainersFromContext so as to avoid the newly added method.

        Show
        jianhe Jian He added a comment - we may run into a situation where we report two different ContainerStatus with same ID I think this is not possible given that we are looping this.context.getContainers() which is based on containerId to Container map. Or we can just use a list. So in my opinion, that could be a potential leak. I see. then we should send the pendingCompletedContainers in getNMContainerStatuses method too. and pendingCompletedContainers.clear() ; should be put after {{if (response.getNodeAction() == NodeAction.RESYNC) }}, or we can just put it at the last line of removeOrTrackCompletedContainersFromContext so as to avoid the newly added method.
        Hide
        chengbing.liu Chengbing Liu added a comment -

        I think this is not possible given that we are looping this.context.getContainers() which is based on containerId to Container map. Or we can just use a list.

        We are looping over context.getContainers(), plus possible remainders from the previous heartbeat (in case of a lost heartbeat). If the previously completed container has its status changed somehow, there would be two different ContainerStatus with same ID reported. That's why I use a map, and use pendingCompletedContainers.put(containerId, containerStatus) instead of containerStatuses.add(containerStatus) directly, in order to prevent such duplications

        then we should send the pendingCompletedContainers in getNMContainerStatuses method too

        We may not need to change getNMContainerStatuses, as it will send all container statuses in NM context, except the containers whose application is not in NM context. I think that will cover all elements in pendingCompletedContainers. And lost heartbeat is not a problem with getNMContainerStatuses.

        or we can just put it at the last line of removeOrTrackCompletedContainersFromContext so as to avoid the newly added method.

        That's a good idea. I will change this in the next patch. Thanks for your advice!

        Show
        chengbing.liu Chengbing Liu added a comment - I think this is not possible given that we are looping this.context.getContainers() which is based on containerId to Container map. Or we can just use a list. We are looping over context.getContainers() , plus possible remainders from the previous heartbeat (in case of a lost heartbeat). If the previously completed container has its status changed somehow, there would be two different ContainerStatus with same ID reported. That's why I use a map, and use pendingCompletedContainers.put(containerId, containerStatus) instead of containerStatuses.add(containerStatus) directly, in order to prevent such duplications then we should send the pendingCompletedContainers in getNMContainerStatuses method too We may not need to change getNMContainerStatuses , as it will send all container statuses in NM context, except the containers whose application is not in NM context. I think that will cover all elements in pendingCompletedContainers . And lost heartbeat is not a problem with getNMContainerStatuses . or we can just put it at the last line of removeOrTrackCompletedContainersFromContext so as to avoid the newly added method. That's a good idea. I will change this in the next patch. Thanks for your advice!
        Hide
        jianhe Jian He added a comment -

        except the containers whose application is not in NM context.

        I think we should send containers whose application is not in NMContext too for recovery.

        Show
        jianhe Jian He added a comment - except the containers whose application is not in NM context. I think we should send containers whose application is not in NMContext too for recovery.
        Hide
        chengbing.liu Chengbing Liu added a comment -

        Jian He Can we perhaps deal with getNMContainerStatuses issue in another JIRA? This one has not changed anything for RM restart yet. If so, the only thing left is the pendingCompletedContainers.clear() thing. What do you think?

        Show
        chengbing.liu Chengbing Liu added a comment - Jian He Can we perhaps deal with getNMContainerStatuses issue in another JIRA? This one has not changed anything for RM restart yet. If so, the only thing left is the pendingCompletedContainers.clear() thing. What do you think?
        Hide
        jianhe Jian He added a comment -

        sure, I will open a jira for it. thx

        Show
        jianhe Jian He added a comment - sure, I will open a jira for it. thx
        Hide
        chengbing.liu Chengbing Liu added a comment -

        Updated patch.

        The testing-only method is removed. pendingCompletedContainers.clear() is added at the end of removeOrTrackCompletedContainersFromContext, and also in RESYNC section to clear the cache so that these outdated container statuses will not be reported to the restarted RM.

        Show
        chengbing.liu Chengbing Liu added a comment - Updated patch. The testing-only method is removed. pendingCompletedContainers.clear() is added at the end of removeOrTrackCompletedContainersFromContext , and also in RESYNC section to clear the cache so that these outdated container statuses will not be reported to the restarted RM.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12690286/YARN-2997.4.patch
        against trunk revision 4cd66f7.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        -1 eclipse:eclipse. The patch failed to build with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6252//testReport/
        Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6252//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690286/YARN-2997.4.patch against trunk revision 4cd66f7. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. -1 eclipse:eclipse . The patch failed to build with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6252//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6252//console This message is automatically generated.
        Hide
        jianhe Jian He added a comment -

        thanks for updating.

        also in RESYNC section to clear the cache so that these outdated container statuses will not be reported to the restarted RM.

        could you explain more why this is added ? I think before this patch, these container statuses will be sent to the restarted RM. after the patch, it won't

        Show
        jianhe Jian He added a comment - thanks for updating. also in RESYNC section to clear the cache so that these outdated container statuses will not be reported to the restarted RM. could you explain more why this is added ? I think before this patch, these container statuses will be sent to the restarted RM. after the patch, it won't
        Hide
        chengbing.liu Chengbing Liu added a comment -

        Once got an RESYNC, NM calls getNMContainerStatuses, which will loop over all containers in the NM context, remove those whose app is not in the NM context, finally report to RM. The method getNMContainerStatuses remains unchanged before and after this patch. The logic of removing containers from context is also unchanged.

        From a different viewpoint, pendingCompletedContainers contains the following:

        • completed containers, whose app is stopped, and the container is removed from the NM context.
        • completed containers, whose app is NOT stopped (which implies their apps are in the NM context), and the container is NOT removed from the NM context.

        The first kind will not be reported to RM since they are not in the NM context, so they will not be looped.
        The second kind will be reported to RM since they are in the NM context, and their apps must be in the NM context.

        Finally, the changes of this patch can be summarized as follows:

        • Does not send finished container statuses repeatedly for running application
        • Send completed container statuses again in case of lost heartbeat (normal heartbeat, not RESYNC)

        I hope this will clarify your doubts.

        Show
        chengbing.liu Chengbing Liu added a comment - Once got an RESYNC, NM calls getNMContainerStatuses , which will loop over all containers in the NM context, remove those whose app is not in the NM context, finally report to RM. The method getNMContainerStatuses remains unchanged before and after this patch. The logic of removing containers from context is also unchanged. From a different viewpoint, pendingCompletedContainers contains the following: completed containers, whose app is stopped, and the container is removed from the NM context. completed containers, whose app is NOT stopped (which implies their apps are in the NM context), and the container is NOT removed from the NM context. The first kind will not be reported to RM since they are not in the NM context, so they will not be looped. The second kind will be reported to RM since they are in the NM context, and their apps must be in the NM context. Finally, the changes of this patch can be summarized as follows: Does not send finished container statuses repeatedly for running application Send completed container statuses again in case of lost heartbeat (normal heartbeat, not RESYNC) I hope this will clarify your doubts.
        Hide
        jianhe Jian He added a comment -

        ah, I think the problem that container statuses whose application are stopped may be lost on NM resync exists before. thanks for your clarification. one minor comment: LinkedHashMap<ContainerId, ContainerStatus>(), a regular HashMap should be enough instead of a linkedHashMap?

        Show
        jianhe Jian He added a comment - ah, I think the problem that container statuses whose application are stopped may be lost on NM resync exists before. thanks for your clarification. one minor comment: LinkedHashMap<ContainerId, ContainerStatus>() , a regular HashMap should be enough instead of a linkedHashMap?
        Hide
        chengbing.liu Chengbing Liu added a comment -

        Yes, a HashMap should be enough. I will upload a new one. Thanks.

        Show
        chengbing.liu Chengbing Liu added a comment - Yes, a HashMap should be enough. I will upload a new one. Thanks.
        Hide
        chengbing.liu Chengbing Liu added a comment -

        Update: use HashMap instead of LinkedHashMap.

        Show
        chengbing.liu Chengbing Liu added a comment - Update: use HashMap instead of LinkedHashMap.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12690687/YARN-2997.5.patch
        against trunk revision ef237bd.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

        Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6276//testReport/
        Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6276//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690687/YARN-2997.5.patch against trunk revision ef237bd. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6276//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6276//console This message is automatically generated.
        Hide
        jianhe Jian He added a comment -

        patch looks good to me.

        Show
        jianhe Jian He added a comment - patch looks good to me.
        Hide
        jianhe Jian He added a comment -

        committed to trunk and branch-2, thanks Chengbing Liu !

        Show
        jianhe Jian He added a comment - committed to trunk and branch-2, thanks Chengbing Liu !
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #6833 (See https://builds.apache.org/job/Hadoop-trunk-Commit/6833/)
        YARN-2997. Fixed NodeStatusUpdater to not send alreay-sent completed container statuses on heartbeat. Contributed by Chengbing Liu (jianhe: rev cc2a745f7e82c9fa6de03242952347c54c52dccc)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
        • hadoop-yarn-project/CHANGES.txt
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #6833 (See https://builds.apache.org/job/Hadoop-trunk-Commit/6833/ ) YARN-2997 . Fixed NodeStatusUpdater to not send alreay-sent completed container statuses on heartbeat. Contributed by Chengbing Liu (jianhe: rev cc2a745f7e82c9fa6de03242952347c54c52dccc) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
        Hide
        chengbing.liu Chengbing Liu added a comment -

        Thanks Jian He !

        Show
        chengbing.liu Chengbing Liu added a comment - Thanks Jian He !
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #68 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/68/)
        YARN-2997. Fixed NodeStatusUpdater to not send alreay-sent completed container statuses on heartbeat. Contributed by Chengbing Liu (jianhe: rev cc2a745f7e82c9fa6de03242952347c54c52dccc)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
        • hadoop-yarn-project/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #68 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/68/ ) YARN-2997 . Fixed NodeStatusUpdater to not send alreay-sent completed container statuses on heartbeat. Contributed by Chengbing Liu (jianhe: rev cc2a745f7e82c9fa6de03242952347c54c52dccc) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java hadoop-yarn-project/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Yarn-trunk #802 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/802/)
        YARN-2997. Fixed NodeStatusUpdater to not send alreay-sent completed container statuses on heartbeat. Contributed by Chengbing Liu (jianhe: rev cc2a745f7e82c9fa6de03242952347c54c52dccc)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
        • hadoop-yarn-project/CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #802 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/802/ ) YARN-2997 . Fixed NodeStatusUpdater to not send alreay-sent completed container statuses on heartbeat. Contributed by Chengbing Liu (jianhe: rev cc2a745f7e82c9fa6de03242952347c54c52dccc) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java hadoop-yarn-project/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #65 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/65/)
        YARN-2997. Fixed NodeStatusUpdater to not send alreay-sent completed container statuses on heartbeat. Contributed by Chengbing Liu (jianhe: rev cc2a745f7e82c9fa6de03242952347c54c52dccc)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
        • hadoop-yarn-project/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #65 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/65/ ) YARN-2997 . Fixed NodeStatusUpdater to not send alreay-sent completed container statuses on heartbeat. Contributed by Chengbing Liu (jianhe: rev cc2a745f7e82c9fa6de03242952347c54c52dccc) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java hadoop-yarn-project/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Hdfs-trunk #2000 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2000/)
        YARN-2997. Fixed NodeStatusUpdater to not send alreay-sent completed container statuses on heartbeat. Contributed by Chengbing Liu (jianhe: rev cc2a745f7e82c9fa6de03242952347c54c52dccc)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
        • hadoop-yarn-project/CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Hdfs-trunk #2000 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2000/ ) YARN-2997 . Fixed NodeStatusUpdater to not send alreay-sent completed container statuses on heartbeat. Contributed by Chengbing Liu (jianhe: rev cc2a745f7e82c9fa6de03242952347c54c52dccc) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java hadoop-yarn-project/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #69 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/69/)
        YARN-2997. Fixed NodeStatusUpdater to not send alreay-sent completed container statuses on heartbeat. Contributed by Chengbing Liu (jianhe: rev cc2a745f7e82c9fa6de03242952347c54c52dccc)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
        • hadoop-yarn-project/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #69 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/69/ ) YARN-2997 . Fixed NodeStatusUpdater to not send alreay-sent completed container statuses on heartbeat. Contributed by Chengbing Liu (jianhe: rev cc2a745f7e82c9fa6de03242952347c54c52dccc) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java hadoop-yarn-project/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2019 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2019/)
        YARN-2997. Fixed NodeStatusUpdater to not send alreay-sent completed container statuses on heartbeat. Contributed by Chengbing Liu (jianhe: rev cc2a745f7e82c9fa6de03242952347c54c52dccc)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
        • hadoop-yarn-project/CHANGES.txt
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2019 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2019/ ) YARN-2997 . Fixed NodeStatusUpdater to not send alreay-sent completed container statuses on heartbeat. Contributed by Chengbing Liu (jianhe: rev cc2a745f7e82c9fa6de03242952347c54c52dccc) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
        Hide
        l201514 Siqi Li added a comment -

        The latest patch can be applied to 2.6.0 branch cleanly

        Show
        l201514 Siqi Li added a comment - The latest patch can be applied to 2.6.0 branch cleanly
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        Pulled this into 2.6.1. Ran compilation and TestNodeStatusUpdater before the push. Patch applied cleanly.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - Pulled this into 2.6.1. Ran compilation and TestNodeStatusUpdater before the push. Patch applied cleanly.

          People

          • Assignee:
            chengbing.liu Chengbing Liu
            Reporter:
            chengbing.liu Chengbing Liu
          • Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development