Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-7055 YARN Timeline Service v.2: beta 1
  3. YARN-8130

Race condition when container events are published for KILLED applications

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.10.0, 3.2.0, 3.1.1, 3.0.3
    • Component/s: ATSv2
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      There seems to be a race condition happening when an application is KILLED and the corresponding container event information is being published. For completed containers, a YARN_CONTAINER_FINISHED event is generated but for some containers in a KILLED application this information is missing. Below is a node manager log snippet,

      2018-04-09 08:44:54,474 INFO  shuffle.ExternalShuffleBlockResolver (ExternalShuffleBlockResolver.java:applicationRemoved(186)) - Application application_1523259757659_0003 removed, cleanupLocalDirs = false
      2018-04-09 08:44:54,478 INFO  application.ApplicationImpl (ApplicationImpl.java:handle(632)) - Application application_1523259757659_0003 transitioned from APPLICATION_RESOURCES_CLEANINGUP to FINISHED
      2018-04-09 08:44:54,478 ERROR timelineservice.NMTimelinePublisher (NMTimelinePublisher.java:putEntity(298)) - Seems like client has been removed before the entity could be published for TimelineEntity[type='YARN_CONTAINER', id='container_1523259757659_0003_01_000002']
      2018-04-09 08:44:54,478 INFO  logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:finishLogAggregation(520)) - Application just finished : application_1523259757659_0003
      2018-04-09 08:44:54,488 INFO  logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs for container container_1523259757659_0003_01_000001. Current good log dirs are /grid/0/hadoop/yarn/log
      2018-04-09 08:44:54,492 INFO  logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs for container container_1523259757659_0003_01_000002. Current good log dirs are /grid/0/hadoop/yarn/log
      2018-04-09 08:44:55,470 INFO  collector.TimelineCollectorManager (TimelineCollectorManager.java:remove(192)) - The collector service for application_1523259757659_0003 was removed
      2018-04-09 08:44:55,472 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:handle(1572)) - couldn't find application application_1523259757659_0003 while processing FINISH_APPS event. The ResourceManager allocated resources for this application to the NodeManager but no active containers were found to process

      The container id specified in the log, container_1523259757659_0003_01_000002 is the one that has the finished event missing.

        Attachments

        1. YARN-8130.01.patch
          5 kB
          Rohith Sharma K S
        2. YARN-8130.02.patch
          12 kB
          Rohith Sharma K S
        3. YARN-8130.03.patch
          13 kB
          Rohith Sharma K S

          Issue Links

            Activity

              People

              • Assignee:
                rohithsharma Rohith Sharma K S
                Reporter:
                charanh Charan Hebri
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: