Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-2816

NM fail to start with NPE during container recovery

    Details

    • Hadoop Flags:
      Reviewed

      Description

      NM fail to start with NPE during container recovery.
      We saw the following crash happen:
      2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state INITED; cause: java.lang.NullPointerException
      java.lang.NullPointerException
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
      at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
      at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
      at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
      at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
      at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
      at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)

      The reason is some DB files used in NMLeveldbStateStoreService are accidentally deleted to save disk space at /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) entry in the DB. When container is recovered at ContainerManagerImpl#recoverContainer,
      The NullPointerException at the following code cause NM shutdown.

          StartContainerRequest req = rcs.getStartRequest();
          ContainerLaunchContext launchContext = req.getContainerLaunchContext();
      
      1. leveldb_records.txt
        16 kB
        zhihai xu
      2. YARN-2816.000.patch
        4 kB
        zhihai xu
      3. YARN-2816.001.patch
        4 kB
        zhihai xu
      4. YARN-2816.002.patch
        4 kB
        zhihai xu

        Issue Links

          Activity

          Hide
          zxu zhihai xu added a comment -

          I submitted a patch to review.
          The fix is to delete all these incomplete container record in NMLeveldbStateStoreService#loadContainersState. so all RecoveredContainerState is valid.
          Also I added a test in the patch, without the fix, the test will fail.

          Show
          zxu zhihai xu added a comment - I submitted a patch to review. The fix is to delete all these incomplete container record in NMLeveldbStateStoreService#loadContainersState. so all RecoveredContainerState is valid. Also I added a test in the patch, without the fix, the test will fail.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12679755/YARN-2816.000.patch
          against trunk revision 86eb27b.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5747//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5747//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679755/YARN-2816.000.patch against trunk revision 86eb27b. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5747//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5747//console This message is automatically generated.
          Hide
          zxu zhihai xu added a comment -

          levelDB document:
          LevelDB never writes in place: it always appends to a log file, or merges existing files together to produce new ones. So an OS crash will cause a partially written log record (or a few partially written log records). LevelDB recovery code uses checksums to detect this and will skip the incomplete records.

          Based on above information, if the incomplete record is the CONTAINER_REQUEST_KEY_SUFFIX record used to store container startRequest, this issue will happen. NM can't protect OS crash. This means we must add the error handling code to avoid NM shutdown due to NPE. This justify the patch.

          Show
          zxu zhihai xu added a comment - levelDB document : LevelDB never writes in place: it always appends to a log file, or merges existing files together to produce new ones. So an OS crash will cause a partially written log record (or a few partially written log records). LevelDB recovery code uses checksums to detect this and will skip the incomplete records. Based on above information, if the incomplete record is the CONTAINER_REQUEST_KEY_SUFFIX record used to store container startRequest, this issue will happen. NM can't protect OS crash. This means we must add the error handling code to avoid NM shutdown due to NPE. This justify the patch.
          Hide
          jlowe Jason Lowe added a comment -

          This seems like a dubious use case. If something comes along and deletes (i.e.: corrupts) the leveldb database then in general the NM will not be able to recover properly. Trying to patch up one particular scenario won't cover the rest, and containers could "leak" (i.e.: be forgotten even though they're still running), container start requests lost, etc.

          As for the OS crash scenario, if the OS crashes then there's nothing left for the NM to recover. If we really want to protect against OS crashes then a much better way is to perform synchronous writes to leveldb. However this is much slower than asynchronous writes and could easily impact NM performance. Given that there's nothing to recover from the OS crash scenario, it doesn't seem worth worrying about that case.

          The real issue for the reported scenario is that the leveldb database location is a poor one for the way that system is configured, since something is coming along and corrupting the database. Either the leveldb database needs to be moved somewhere else or the file cleanup procedure needs to exclude the leveldb database.

          Show
          jlowe Jason Lowe added a comment - This seems like a dubious use case. If something comes along and deletes (i.e.: corrupts) the leveldb database then in general the NM will not be able to recover properly. Trying to patch up one particular scenario won't cover the rest, and containers could "leak" (i.e.: be forgotten even though they're still running), container start requests lost, etc. As for the OS crash scenario, if the OS crashes then there's nothing left for the NM to recover. If we really want to protect against OS crashes then a much better way is to perform synchronous writes to leveldb. However this is much slower than asynchronous writes and could easily impact NM performance. Given that there's nothing to recover from the OS crash scenario, it doesn't seem worth worrying about that case. The real issue for the reported scenario is that the leveldb database location is a poor one for the way that system is configured, since something is coming along and corrupting the database. Either the leveldb database needs to be moved somewhere else or the file cleanup procedure needs to exclude the leveldb database.
          Hide
          zxu zhihai xu added a comment -

          Hi Jason Lowe, thanks for the review,
          Based on your comment, it look like the the issue is not as serious as what I thought previously, I lowered the issue from Critical to Major.
          move the levelDB database location from /tmp directory to other safe directory is a very good suggestion.

          But I still think the patch is good for better error handling and NM recovery.

          1. If an OS crash will cause a few partially written log record in the levelDB, we can't restart the NM due to NPE.
          To restart the NM, we need manually delete all these stateStore files. I think it will be bad for the user.
          Also with the patch, we still can recover most of the containers in the NM instead of losing all these container information, which will cause all these containers to be reallocated by RM.

          2. The patch is to delete all these containers which don't have container start requests. It won't cause containers leaks. Because container start request is always the first entry to store(startContainerInternal) in the levelDB for each container records and it is always the first entry to remove (removeContainer) in the levelDB for each container records. Also when I debugged this problem, the levelDB stored most latest records in a LRU cache(memory), when the levelDB is closed, it will store these latest records from cache to the disk and our use case for levelDB is always to write key-value record and no read after NM initialization. The data record lost on disk will always the old record, so container start request record is more likely lost than other container records. This makes the patch meaningful for our use case.
          I attached a LevelDB container key record which has this issue.

          3. Missing containers record is not a problem if we can recover most containers in NM.
          When AM calls ContainerManagementProtocol#getContainerStatuses, the NM will put the missing containers in failed requests of GetContainerStatusesResponse, then the AM will notify RM to release these missing containers in ApplicationMasterProtocol#allocate(AllocateRequest.getReleaseList). So the missing containers will be recovered quickly by AM.
          With the patch, we still can recover all these containers with complete record in levelDB.
          The less containers we missed to recover, the better.

          4. The patch is small and safe, It won't cause any side effect.

          Show
          zxu zhihai xu added a comment - Hi Jason Lowe , thanks for the review, Based on your comment, it look like the the issue is not as serious as what I thought previously, I lowered the issue from Critical to Major. move the levelDB database location from /tmp directory to other safe directory is a very good suggestion. But I still think the patch is good for better error handling and NM recovery. 1. If an OS crash will cause a few partially written log record in the levelDB, we can't restart the NM due to NPE. To restart the NM, we need manually delete all these stateStore files. I think it will be bad for the user. Also with the patch, we still can recover most of the containers in the NM instead of losing all these container information, which will cause all these containers to be reallocated by RM. 2. The patch is to delete all these containers which don't have container start requests. It won't cause containers leaks. Because container start request is always the first entry to store(startContainerInternal) in the levelDB for each container records and it is always the first entry to remove (removeContainer) in the levelDB for each container records. Also when I debugged this problem, the levelDB stored most latest records in a LRU cache(memory), when the levelDB is closed, it will store these latest records from cache to the disk and our use case for levelDB is always to write key-value record and no read after NM initialization. The data record lost on disk will always the old record, so container start request record is more likely lost than other container records. This makes the patch meaningful for our use case. I attached a LevelDB container key record which has this issue. 3. Missing containers record is not a problem if we can recover most containers in NM. When AM calls ContainerManagementProtocol#getContainerStatuses, the NM will put the missing containers in failed requests of GetContainerStatusesResponse, then the AM will notify RM to release these missing containers in ApplicationMasterProtocol#allocate(AllocateRequest.getReleaseList). So the missing containers will be recovered quickly by AM. With the patch, we still can recover all these containers with complete record in levelDB. The less containers we missed to recover, the better. 4. The patch is small and safe, It won't cause any side effect.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12679963/leveldb_records.txt
          against trunk revision 75b820c.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5758//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679963/leveldb_records.txt against trunk revision 75b820c. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5758//console This message is automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          It won't cause containers leaks. Because container start request is always the first entry to store(startContainerInternal) in the levelDB for each container records and it is always the first entry to remove (removeContainer) in the levelDB for each container records.

          I don't understand this statement. If the start container request is the first record to be lost, what happens if we write the start container request, launch (or maybe don't launch) the container, then restart? If we lost the container start record but the container had not completed before the restart, didn't we just lose track of it upon recovery?

          Anyway this shouldn't make things much worse than the NM failing to start up if this specific instance of database corruption occurs. I just think we need to realize that there are many other ways the database could be corrupted and this only works around a very specific instance of it. Comments on the patch:

          +      LOG.info("Remove container " + containerId +
          +          " with incomplete records");
          

          The above needs to be logged at least at the warn level if not error. We have very likely leaked a container. Also the code should do much more than just forget the container and instead look for the pid file, try to kill it if found, and return a recovered container status of killed/lost or something similar. We shouldn't just pretend the container didn't exist when returning recovered containers.

          -        LOG.info("Creating state database at " + dbfile);
          +        LOG.info("Creating state database at " + dbfile, e);
          

          Why was this change made? I don't see the point of logging the exception showing the database didn't exist when we already checked for that condition in this code path.

          Show
          jlowe Jason Lowe added a comment - It won't cause containers leaks. Because container start request is always the first entry to store(startContainerInternal) in the levelDB for each container records and it is always the first entry to remove (removeContainer) in the levelDB for each container records. I don't understand this statement. If the start container request is the first record to be lost, what happens if we write the start container request, launch (or maybe don't launch) the container, then restart? If we lost the container start record but the container had not completed before the restart, didn't we just lose track of it upon recovery? Anyway this shouldn't make things much worse than the NM failing to start up if this specific instance of database corruption occurs. I just think we need to realize that there are many other ways the database could be corrupted and this only works around a very specific instance of it. Comments on the patch: + LOG.info("Remove container " + containerId + + " with incomplete records"); The above needs to be logged at least at the warn level if not error. We have very likely leaked a container. Also the code should do much more than just forget the container and instead look for the pid file, try to kill it if found, and return a recovered container status of killed/lost or something similar. We shouldn't just pretend the container didn't exist when returning recovered containers. - LOG.info("Creating state database at " + dbfile); + LOG.info("Creating state database at " + dbfile, e); Why was this change made? I don't see the point of logging the exception showing the database didn't exist when we already checked for that condition in this code path.
          Hide
          zxu zhihai xu added a comment -

          Hi Jason Lowe,

          thanks for the review.
          Sorry I misunderstands the containers leaks you talk about. My containers leaks means if the container has container start record, it must have complete records, which is for my error case in the attached levelDB files. So if we remove all these containers without container start record, the remaining containers in levelDB will have complete records.

          I fully agree to your comment. I attached a new patch which addressed all the comments except one:

          We have very likely leaked a container. Also the code should do much more than just forget the container and instead look for the pid file, try to kill it if found, and return a recovered container status of killed/lost or something similar.

          I added todo comment in the code for this.
          // TODO: kill and cleanup the leaked container
          I will create a separate JIRA to address this:
          Because it need do a lot of stuff and also it can be shared by all other container leakage cases in the NM:
          For example, when AM asks a container status which doesn't exist in the NM containers' list, which most likely is a leaked container.

          It is a great pleasure to discuss the issue with you, I learned a lot from your comment.

          Show
          zxu zhihai xu added a comment - Hi Jason Lowe , thanks for the review. Sorry I misunderstands the containers leaks you talk about. My containers leaks means if the container has container start record, it must have complete records, which is for my error case in the attached levelDB files. So if we remove all these containers without container start record, the remaining containers in levelDB will have complete records. I fully agree to your comment. I attached a new patch which addressed all the comments except one: We have very likely leaked a container. Also the code should do much more than just forget the container and instead look for the pid file, try to kill it if found, and return a recovered container status of killed/lost or something similar. I added todo comment in the code for this. // TODO: kill and cleanup the leaked container I will create a separate JIRA to address this: Because it need do a lot of stuff and also it can be shared by all other container leakage cases in the NM: For example, when AM asks a container status which doesn't exist in the NM containers' list, which most likely is a leaked container. It is a great pleasure to discuss the issue with you, I learned a lot from your comment.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12680250/YARN-2816.001.patch
          against trunk revision a71e930.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 javac. The applied patch generated 1396 javac compiler warnings (more than the trunk's current 1220 warnings).

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5790//testReport/
          Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5790//artifact/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5790//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680250/YARN-2816.001.patch against trunk revision a71e930. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javac . The applied patch generated 1396 javac compiler warnings (more than the trunk's current 1220 warnings). +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5790//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5790//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5790//console This message is automatically generated.
          Hide
          zxu zhihai xu added a comment -

          Attached a new patch ../YARN-2816.002.patch to fix javac compiler warnings.
          also this patch has a test case.

          Show
          zxu zhihai xu added a comment - Attached a new patch ../ YARN-2816 .002.patch to fix javac compiler warnings. also this patch has a test case.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12681235/YARN-2816.002.patch
          against trunk revision 9f0319b.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5834//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5834//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681235/YARN-2816.002.patch against trunk revision 9f0319b. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5834//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5834//console This message is automatically generated.
          Hide
          zxu zhihai xu added a comment -

          Hi Jason Lowe,
          Are you ok with the new patch?
          thanks
          zhihai

          Show
          zxu zhihai xu added a comment - Hi Jason Lowe , Are you ok with the new patch? thanks zhihai
          Hide
          jlowe Jason Lowe added a comment -

          +1 lgtm. Committing this.

          Show
          jlowe Jason Lowe added a comment - +1 lgtm. Committing this.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks, zhihai xu! I committed this to trunk and branch-2.

          Show
          jlowe Jason Lowe added a comment - Thanks, zhihai xu ! I committed this to trunk and branch-2.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #6549 (See https://builds.apache.org/job/Hadoop-trunk-Commit/6549/)
          YARN-2816. NM fail to start with NPE during container recovery. Contributed by Zhihai Xu (jlowe: rev 49c38898b0be64fc686d039ed2fb2dea1378df02)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #6549 (See https://builds.apache.org/job/Hadoop-trunk-Commit/6549/ ) YARN-2816 . NM fail to start with NPE during container recovery. Contributed by Zhihai Xu (jlowe: rev 49c38898b0be64fc686d039ed2fb2dea1378df02) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java
          Hide
          zxu zhihai xu added a comment -

          thanks Jason Lowe for the review and commit.

          Show
          zxu zhihai xu added a comment - thanks Jason Lowe for the review and commit.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk #744 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/744/)
          YARN-2816. NM fail to start with NPE during container recovery. Contributed by Zhihai Xu (jlowe: rev 49c38898b0be64fc686d039ed2fb2dea1378df02)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #744 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/744/ ) YARN-2816 . NM fail to start with NPE during container recovery. Contributed by Zhihai Xu (jlowe: rev 49c38898b0be64fc686d039ed2fb2dea1378df02) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #6 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/6/)
          YARN-2816. NM fail to start with NPE during container recovery. Contributed by Zhihai Xu (jlowe: rev 49c38898b0be64fc686d039ed2fb2dea1378df02)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #6 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/6/ ) YARN-2816 . NM fail to start with NPE during container recovery. Contributed by Zhihai Xu (jlowe: rev 49c38898b0be64fc686d039ed2fb2dea1378df02) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Hdfs-trunk #1934 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1934/)
          YARN-2816. NM fail to start with NPE during container recovery. Contributed by Zhihai Xu (jlowe: rev 49c38898b0be64fc686d039ed2fb2dea1378df02)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Hdfs-trunk #1934 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1934/ ) YARN-2816 . NM fail to start with NPE during container recovery. Contributed by Zhihai Xu (jlowe: rev 49c38898b0be64fc686d039ed2fb2dea1378df02) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #6 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/6/)
          YARN-2816. NM fail to start with NPE during container recovery. Contributed by Zhihai Xu (jlowe: rev 49c38898b0be64fc686d039ed2fb2dea1378df02)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #6 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/6/ ) YARN-2816 . NM fail to start with NPE during container recovery. Contributed by Zhihai Xu (jlowe: rev 49c38898b0be64fc686d039ed2fb2dea1378df02) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #1958 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1958/)
          YARN-2816. NM fail to start with NPE during container recovery. Contributed by Zhihai Xu (jlowe: rev 49c38898b0be64fc686d039ed2fb2dea1378df02)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1958 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1958/ ) YARN-2816 . NM fail to start with NPE during container recovery. Contributed by Zhihai Xu (jlowe: rev 49c38898b0be64fc686d039ed2fb2dea1378df02) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #6 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/6/)
          YARN-2816. NM fail to start with NPE during container recovery. Contributed by Zhihai Xu (jlowe: rev 49c38898b0be64fc686d039ed2fb2dea1378df02)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #6 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/6/ ) YARN-2816 . NM fail to start with NPE during container recovery. Contributed by Zhihai Xu (jlowe: rev 49c38898b0be64fc686d039ed2fb2dea1378df02) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Pulled this into 2.6.1. Ran compilation and TestNMLeveldbStateStoreService before the push. Patch applied cleanly.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Pulled this into 2.6.1. Ran compilation and TestNMLeveldbStateStoreService before the push. Patch applied cleanly.

            People

            • Assignee:
              zxu zhihai xu
              Reporter:
              zxu zhihai xu
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development