[YARN-10059] Final states of failed-to-localize containers are not recorded in NM state store - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: nodemanager
Labels:
None

Target Version/s:

3.5.0

Description

Currently we found an issue that many localizers of completed containers were launched and exhausted memory/cpu of that machine after NM restarted, these containers were all failed and completed when localizing on a non-existed local directory which is caused by another problem, but their final states weren't recorded in NM state store.
The process flow of a fail-to-localize container is as follow:

ResourceLocalizationService$LocalizerRunner#run
-> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> LOCALIZATION_FAILED upon RESOURCE_FAILED
      dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES
      -> ResourceLocalizationService#handleCleanupContainerResources  handle CLEANUP_CONTAINER_RESOURCES
          dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP
          -> ContainerImpl$LocalizationFailedToDoneTransition#transition  handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP

There's no update for state store in this flow now, which is required to avoid unnecessary localizations after NM restarts.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-10059.001.patch
25/Dec/19 01:33
4 kB
Tao Yang

Activity

People

Assignee:: Tao Yang

Reporter:: Tao Yang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 24/Dec/19 07:03

Updated:: 04/Jan/24 08:33