Details
Description
When RM publishes container events i.e by enabling yarn.rm.system-metrics-publisher.emit-container-events, there is race condition for processing events
vs appFinished event that removes appId from collector list which cause NPE.
Look at the below trace where appId is removed from collectors first and then corresponding events are processed.
2017-06-06 19:28:48,896 INFO capacity.ParentQueue (ParentQueue.java:removeApplication(472)) - Application removed - appId: application_1496758895643_0005 user: root leaf-queue of parent: root #applications: 0 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager (TimelineCollectorManager.java:remove(190)) - The collector service for application_1496758895643_0005 was removed 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing entity TimelineEntity[type='YARN_CONTAINER', id='container_e01_1496758895643_0005_01_000002'] java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) at org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) at org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) at org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) at java.lang.Thread.run(Thread.java:745)
Attachments
Attachments
Issue Links
- duplicates
-
YARN-7539 NullPointerException in timeline service v2
-
- Resolved
-
- is depended upon by
-
YARN-9185 TimelineServiceV2Publisher throws NPE when app is finished before container metrics updated
-
- Resolved
-
- is duplicated by
-
YARN-9447 RM Crashes with NPE at TimelineServiceV2Publisher.putEntity
-
- Resolved
-
-
YARN-9215 RM throws NPE and shutdown when trying to stop a service
-
- Resolved
-
- is related to
-
YARN-7835 [Atsv2] Race condition in NM while publishing events if second attempt is launched on the same node
-
- Resolved
-
- relates to
-
YARN-8130 Race condition when container events are published for KILLED applications
-
- Resolved
-