Details
Description
Steps followed.
1) Update nodemanager debug delay config
<property> <name>yarn.nodemanager.delete.debug-delay-sec</name> <value>350</value> </property>
2) Launch distributed shell application multiple times
/usr/hdp/current/hadoop-yarn-client/bin/yarn jar hadoop-yarn-applications-distributedshell-*.jar -shell_command "sleep 120" -num_containers 1 -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=centos/httpd-24-centos7:latest -shell_env YARN_CONTAINER_RUNTIME_DOCKER_DELAYED_REMOVAL=true -jar hadoop-yarn-applications-distributedshell-*.jar
3) restart NM
Nodemanager fails to start with below error.
{code:title=NM log} 2018-03-23 21:32:14,437 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:serviceInit(181)) - ContainersMonitor enabled: true 2018-03-23 21:32:14,439 INFO logaggregation.LogAggregationService (LogAggregationService.java:serviceInit(130)) - rollingMonitorInterval is set as 3600. The logs will be aggregated every 3600 seconds 2018-03-23 21:32:14,455 INFO service.AbstractService (AbstractService.java:noteFailure(267)) - Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state INITED java.lang.NumberFormatException: For input string: "" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:601) at java.lang.Long.parseLong(Long.java:631) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState(NMLeveldbStateStoreService.java:350) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState(NMLeveldbStateStoreService.java:253) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:365) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:464) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:899) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:960) 2018-03-23 21:32:14,458 INFO logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(148)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit 2018-03-23 21:32:14,460 INFO service.AbstractService (AbstractService.java:noteFailure(267)) - Service NodeManager failed in state INITED java.lang.NumberFormatException: For input string: "" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:601) at java.lang.Long.parseLong(Long.java:631) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState(NMLeveldbStateStoreService.java:350) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState(NMLeveldbStateStoreService.java:253) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:365) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:464) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:899) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:960) 2018-03-23 21:32:14,463 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(210)) - Stopping NodeManager metrics system... 2018-03-23 21:32:14,464 INFO impl.MetricsSinkAdapter (MetricsSinkAdapter.java:publishMetricsFromQueue(141)) - timeline thread interrupted. 2018-03-23 21:32:14,468 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(216)) - NodeManager metrics system stopped. 2018-03-23 21:32:14,508 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(607)) - NodeManager metrics system shutdown complete.