Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10059

Final states of failed-to-localize containers are not recorded in NM state store

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: nodemanager
    • Labels:
      None
    • Target Version/s:

      Description

      Currently we found an issue that many localizers of completed containers were launched and exhausted memory/cpu of that machine after NM restarted, these containers were all failed and completed when localizing on a non-existed local directory which is caused by another problem, but their final states weren't recorded in NM state store.
      The process flow of a fail-to-localize container is as follow:

      ResourceLocalizationService$LocalizerRunner#run
      -> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> LOCALIZATION_FAILED upon RESOURCE_FAILED
            dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES
            -> ResourceLocalizationService#handleCleanupContainerResources  handle CLEANUP_CONTAINER_RESOURCES
                dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP
                -> ContainerImpl$LocalizationFailedToDoneTransition#transition  handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP
      

      There's no update for state store in this flow now, which is required to avoid unnecessary localizations after NM restarts.

        Attachments

        1. YARN-10059.001.patch
          4 kB
          Tao Yang

          Activity

            People

            • Assignee:
              Tao Yang Tao Yang
              Reporter:
              Tao Yang Tao Yang
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: