Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Currently we found an issue that many localizers of completed containers were launched and exhausted memory/cpu of that machine after NM restarted, these containers were all failed and completed when localizing on a non-existed local directory which is caused by another problem, but their final states weren't recorded in NM state store.
The process flow of a fail-to-localize container is as follow:
ResourceLocalizationService$LocalizerRunner#run -> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> LOCALIZATION_FAILED upon RESOURCE_FAILED dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES -> ResourceLocalizationService#handleCleanupContainerResources handle CLEANUP_CONTAINER_RESOURCES dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP -> ContainerImpl$LocalizationFailedToDoneTransition#transition handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP
There's no update for state store in this flow now, which is required to avoid unnecessary localizations after NM restarts.