Details
-
Improvement
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
2.7.1
-
None
Description
Recently, I found a issue on nodemanager metrics.That's NodeManagerMetrics#containersLaunched is not actually means the container succeed launched times.Because in some time, it will be failed when receiving the killing command or happening container-localizationFailed.This will lead to a failed container.But now,this counter value will be increased in these code whenever the container is started successfully or failed.
Credentials credentials = parseCredentials(launchContext); Container container = new ContainerImpl(getConfig(), this.dispatcher, context.getNMStateStore(), launchContext, credentials, metrics, containerTokenIdentifier); ApplicationId applicationID = containerId.getApplicationAttemptId().getApplicationId(); if (context.getContainers().putIfAbsent(containerId, container) != null) { NMAuditLogger.logFailure(user, AuditConstants.START_CONTAINER, "ContainerManagerImpl", "Container already running on this node!", applicationID, containerId); throw RPCUtil.getRemoteException("Container " + containerIdStr + " already is running on this node!!"); } this.readLock.lock(); try { if (!serviceStopped) { // Create the application Application application = new ApplicationImpl(dispatcher, user, applicationID, credentials, context); if (null == context.getApplications().putIfAbsent(applicationID, application)) { LOG.info("Creating a new application reference for app " + applicationID); LogAggregationContext logAggregationContext = containerTokenIdentifier.getLogAggregationContext(); Map<ApplicationAccessType, String> appAcls = container.getLaunchContext().getApplicationACLs(); context.getNMStateStore().storeApplication(applicationID, buildAppProto(applicationID, user, credentials, appAcls, logAggregationContext)); dispatcher.getEventHandler().handle( new ApplicationInitEvent(applicationID, appAcls, logAggregationContext)); } this.context.getNMStateStore().storeContainer(containerId, request); dispatcher.getEventHandler().handle( new ApplicationContainerInitEvent(container)); this.context.getContainerTokenSecretManager().startContainerSuccessful( containerTokenIdentifier); NMAuditLogger.logSuccess(user, AuditConstants.START_CONTAINER, "ContainerManageImpl", applicationID, containerId); // TODO launchedContainer misplaced -> doesn't necessarily mean a container // launch. A finished Application will not launch containers. metrics.launchedContainer(); metrics.allocateContainer(containerTokenIdentifier.getResource()); } else { throw new YarnException( "Container start failed as the NodeManager is " + "in the process of shutting down"); }
In addition, we are lack of localzationFailed metric in container.