Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10059

Final states of failed-to-localize containers are not recorded in NM state store

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • nodemanager
    • None

    Description

      Currently we found an issue that many localizers of completed containers were launched and exhausted memory/cpu of that machine after NM restarted, these containers were all failed and completed when localizing on a non-existed local directory which is caused by another problem, but their final states weren't recorded in NM state store.
      The process flow of a fail-to-localize container is as follow:

      ResourceLocalizationService$LocalizerRunner#run
      -> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> LOCALIZATION_FAILED upon RESOURCE_FAILED
            dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES
            -> ResourceLocalizationService#handleCleanupContainerResources  handle CLEANUP_CONTAINER_RESOURCES
                dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP
                -> ContainerImpl$LocalizationFailedToDoneTransition#transition  handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP
      

      There's no update for state store in this flow now, which is required to avoid unnecessary localizations after NM restarts.

      Attachments

        1. YARN-10059.001.patch
          4 kB
          Tao Yang

        Activity

          People

            Tao Yang Tao Yang
            Tao Yang Tao Yang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: