Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
Reviewed
Description
RMTimelineCollector should have the context info of each app whose entity has been put
Attachments
Attachments
- YARN-3390.1.patch
- 73 kB
- Zhijie Shen
- YARN-3390.2.patch
- 71 kB
- Zhijie Shen
- YARN-3390.3.patch
- 71 kB
- Zhijie Shen
- YARN-3390.4.patch
- 71 kB
- Zhijie Shen
Issue Links
- blocks
-
YARN-3044 [Event producers] Implement RM writing app lifecycle events to ATS
- Resolved
- depends upon
-
MAPREDUCE-6335 convert load test driver to timeline service v.2
- Resolved
- is depended upon by
-
YARN-3044 [Event producers] Implement RM writing app lifecycle events to ATS
- Resolved
- is related to
-
YARN-3545 Investigate the concurrency issue with the map of timeline collector
- Resolved
Activity
Hi zjshen, could you please confirm that this JIRA will also block all storage layer implementations? Or we can proceed after YARN-3040 is in? Thanks!
It shouldn't. Storage layer implementations only depends on the writer interface, which is covered in YARN-3040.
Hi zjshen, junping_du & sjlee0
TimelineCollector.getTimelineEntityContext() interface will not be suitable for the RMTimelineCollector as it will be posting/putting entities for multiple apps, appattempts and containers. Hence was initially planning to modify this method to take a TimelineEntity.Identifier as a parameter and @ RMTimelineCollector planning to hold a map of TimelineEntity.Identifier to AppId and another Map to hold AppId to TimelineEntityContext (required as context is created per app when appCreatedEvent occurs).
But one more conflict which i can see is AppLevelTimelineCollector is specific for a app, so invoking getTimelineEntityContext in getTimelineEntityContext(TimelineEntities ,Ugi) is fine because all the entities which are posted can be assumed to have same context as they belong to a single app but in a general case (like RMTimelineCollector) its not guaranteed that all TimelineEntities belong to same app(i.e. TimelineEntities might have diff context). so would it be better to change the interface of TimelineCollector.putEntities) to accept the TimelineEntityContext as parameter and remove TimelineCollector.getTimelineEntityContext() method from interface ? please share your opinion...
I think it makes sense to generalize TimelineEntityContext from a single app's context to the app -> context map. Reader may need this map too. I'll fix the problem after YARN-3391 is done.
I think we need to either pass in the context per call or have a map of app id to context. I would favor the latter approach because it'd be easier on the perspective of callers of putEntities().
Thanks for the feedback zjshen & sjlee0,
either pass in the context per call or have a map of app id to context. I would favor the latter approach because it'd be easier on the perspective of callers of putEntities().
I too agree it will be easier easier on the perspective of callers of putEntities() but if we favor for map of app id to context
- implicit assumption would be that {{putEntities(TimelineEntities ) }} will be for same appId(/will have have the same context)
- TimelineEntities as such do not have appID explicitly, so planning to modify TimelineCollector.getTimelineEntityContext() to TimelineCollector.getTimelineEntityContext(TimelineEntity.Identifier id) and subclasses of TimelineCollector can take care of mapping the Id to the Context (via AppId) if required.
- code of putEntities(TimelineEntities) would look something like
Iterator<TimelineEntity> iterator = entities.getEntities().iterator(); TimelineEntity next = (iterator.hasNext())?iterator.next():null; if(null!=next) { TimelineCollectorContext context = getTimelineEntityContext(next.getIdentifier()); return writer.write(context.getClusterId(), context.getUserId(), context.getFlowId(), context.getFlowRunId(), context.getAppId(), entities); }
If its ok then shall i work on it ?
Thanks for sharing your idea, Naga! Thinking about RM side use case again, an RMTimelineCollectorManager is better than an RMTimelineCollector with a map of appId -> collector context info. RMTimelineCollectorManager is similar to what we have to NM, i.e., TimelineCollectorManager. However, it doesn't need IPC interface and web server, but maintain the collection of app-level collector an RM side. The benefit is that we can reuse most of the code of app-level collector as well as collector manager, uniforming the way we write the timeline entities.
The approach is to make TimelineCollectorManager abstract and contain the common logic of managing collector lifecycles, and have NMTimelineCollectorManager extend TimelineCollectorManager by adding the IRP interface and start the web server, and have RMTimelineCollectorManager extend TimelineCollectorManager by sourcing RMContext for the context info and coupling with RMApp to add/remove the app-level collector. I created patch for this approach.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12725163/YARN-3390.1.patch
against trunk revision b5a0b24.
-1 patch. The patch command could not apply the patch.
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7327//console
This message is automatically generated.
To help folks understand the patch, I introduce a bit more about the skeleton of the code change:
- TimelineCollectorManager: the base class of managing collectors' life cycles
- NMTimelineCollectorManager: extends TimelineCollectorManager, start web server to receive the incoming request and start the RPC channel for getting the context info publishing the service address.
- RMTimelineCollectorManager: extends TimelineCollectorManager, and source rmContext for getting the context info.
- ResourceManager: create RMTimelineCollectorManager and make it accessible inside the daemon via RMContext.
- RMAppManager: add and remove app-level collector into/out of RMTimelineCollectorManager according to RMApp's lifecycle.
After this, YARN-3044's scope is clearer, inside SystemMetricsPublisher, if we're using YTS v2, we compose the new timeline entity, pick the corresponding app-level collector from RMTimelineCollectorManager, and write through it. It's similar to what we're doing for writing DS and MR data to new timeline service.
Thanks for the patch zjshen, As earlier mentioned in Yarn-3044 if we go ahead with the notion that RM will be aware of RMCollector(manager) through RMcontext then in future if we want to segregate the dependencies b/w RM and Timeline service projects again need to do the changes in the core part (RMAppImpl and Resourcemanager), If the dependencies are fine then approach looks fine to me .
Alternative way to avoid dependency is invoking RMTimelineCollectorManager's putIfAbsent() and remove(ApplicationId appId) when SMP needs to publish publishApplicationCreatedEvent and publishApplicationFinishedEvent. In the approach what i have taken in Yarn-3044, RMTimelineCollector(can be renamed to TimelineServiceV2Publisher which will be responsible for publishing v2 events) can invoke RMTimelineCollectorManager's putIfAbsent() and remove(ApplicationId appId).
I think for the problem of dependencies, our current project structure is too coarse-grained. Ideally we may want to have something like "timeline-client" for NM/RM/AMs to depend on. For now, the plan to reuse most of the collector's code LGTM. We can put app-collector mapping into the RM collector manager, just as storage layer connections. Chopping down dependencies in the future would be much easier than removing (mostly but not completely) duplicated collector code IMHO.
Thanks Naganarasimha and zjshen! I'll take a look at the patch a little more closely later. I just wanted to bring to your attention YARN-3437. There I did a fairly similar (but not identical) refactoring of TimelineCollectorManager as part of isolating the core piece of the timeline collector manager independent of the NM-TS interaction, etc. So hopefully we can arrive at a version that can satisfy both needs.
Also, a small nit: I'm not too sure if I like the name "NMTimelineCollectorManager" (the "NM" part of it). It suggests bit too strongly that this is part of the NM. How about "NodeTimelineCollectorManager" or another bit more independent name? I can't think of many other alternatives, so any suggestion is welcome.
There I did a fairly similar (but not identical) refactoring of TimelineCollectorManager as part of isolating the core piece of the timeline collector manager independent of the NM-TS interaction, etc. So hopefully we can arrive at a version that can satisfy both needs.
Interesting! It seems that we both find most of the code of TimelineCollectorManager in different places. I have quick glance and find two difference:
1. I change appId type from string to ApplicationId, because I think the strong type, ApplicationId, is more commonly used in YARN, while I see we have trivial conversion from string to ApplicationId and then to string again.
2. For putIfAbsent and remove, I don't use template method pattern, but let the subclass override the super class method and invoke it inside the override implementation, because I'm not sure if we will need pre process or post process, and if we only invoke the process when adding a new collector. If we're sure about template, I'm okay with the template pattern too.
For putIfAbsent and remove, I don't use template method pattern, but let the subclass override the super class method and invoke it inside the override implementation, because I'm not sure if we will need pre process or post process, and if we only invoke the process when adding a new collector. If we're sure about template, I'm okay with the template pattern too.
I'm fine with either approach. The main reason I thought of that is I wanted to be clear that the base implementation of putIfAbsent() and remove() is mandatory (i.e. not optional). Since we control all of it (base and subclasses), it might not be such a big deal either way.
I took a pass at the patch, and it looks good for the most part. I would ask you to reconcile the TimelineCollectorManager changes with what I have over on YARN-3437. Again, I have a slight preference for the hook/template methods for the aforementioned reason, but it's not a strong preference one way or another.
However, I'm not sure why there is a change for RMContainerAllocator.java. It doesn't look like an intended change?
I skimmed through both patches. The two patches got significant overlap. Personally I'd incline to focus on YARN-3437 for now since it's critical to writer performance benchmark, which will further block the writer implementations. Writer implementation is on our critical path now to deliver an end-to-end preview of timeline v2. Therefore I'd prefer to give 3437 higher priority for now
For the code, are there any special reasons why ConcurrentHashMap's fine-grained locking is not sufficient for collector manager? I think it may always be attractive to use finer-grained locking if we do not need strong consistency semantics, like snapshot or iterations.
The only conflict part between YARN-3437 and this Jira is TimelineCollectorManager base class. And we happened to resort to the similar refactoring method. I'm okay to commit YARN-3437 first. However, the comments about the base TimelineCollectorManager also apply. At least, I think we should use ApplicationId instead of String.
Thanks Zhijie. I'll move forward with the existing patch for YARN-3437. You can still make the change of String => ApplicationId as part of this JIRA (as it involves more refactoring). How's that sound?
Hi zjshen, thanks for the patch! Here are some of my comments. Most of them are quite minor:
- Changes in RMContainerAllocator.java appears to be irrelevant. Seems like this is changed by an IDE by mistake (on a refactoring)?
- In the following lines
+ for (String tag : app.getApplicationTags()) { + String value = null; + if ((value = getFlowContext(TimelineUtils.FLOW_NAME_TAG_PREFIX, tag)) != null the first null assignment to value is marked as redundant + if ((value = getFlowContext(TimelineUtils.FLOW_NAME_TAG_PREFIX, tag)) != null + && !value.isEmpty()) { + collector.getTimelineEntityContext().setFlowName(value); + } else if ((value = getFlowContext(TimelineUtils.FLOW_VERSION_TAG_PREFIX, tag)) != null + && !value.isEmpty()) { + collector.getTimelineEntityContext().setFlowVersion(value); + } else if ((value = getFlowContext(TimelineUtils.FLOW_RUN_ID_TAG_PREFIX, tag)) != null + && !value.isEmpty()) { + collector.getTimelineEntityContext().setFlowRunId(Long.valueOf(value)); + }
Maybe we’d like to use a switch statement to deal with this? We may first split the tag into two parts, based on the first “:”, and then switch the first part of the returned array to set the second part of the array into flow name, version, and run id. Am I missing any fundamental obstacles for us to do this here? (String switch is available from Java 7)
- Rename MyNMTimelineCollectorManager in TestTimelineServiceClientIntegration with something indicating it's for testing?
- In the following lines:
- protected TimelineCollectorContext getTimelineEntityContext() { + public TimelineCollectorContext getTimelineEntityContext() {
We're exposing TimelineCollectorContext but we're not annotating the class. Even though we may treat unannotated classes as Audience.Private, maybe we'd like to mark it as unstable?
- In TimelineCollectorManager, I'm still having this question, although we may not want to address it in this JIRA: are there any special consistency requirements that prevent us from using ConcurrentHashMap?
- In TimelineCollectorWebService, why we're removing the utility function getCollector? I think we can reuse it when adding new web services.
In TimelineCollectorManager, I'm still having this question, although we may not want to address it in this JIRA: are there any special consistency requirements that prevent us from using ConcurrentHashMap?
I can answer this as I added that code. In putIfAbsent(), it needs to start the collector as well if get() returns null. If we used ConcurrentHashMap and removed synchronization, multiple threads could start their own collectors unnecessarily. It is probably not a show stopper but less than desirable. Also, in real life the contention on TimelineCollectorManager is low enough that synchronization should be perfectly adequate. If we want to do this without synchronization, then we would want to use something like guava's LoadingCache.
Hi sjlee0, thanks for the note! Your rational sounds quite reasonable so let keep the current design here. For now I'm OK with the coarse-grained synchronization.
Thanks for the comments. I've addressed Sangjin and Li's comments except:
maybe we'd like to mark it as unstable?
It's not the API for the users, hence it's okay to leave it unannotated.
In TimelineCollectorWebService, why we're removing the utility function getCollector?
After the refactoring, we don't need to convert appId to string. It's not necessary to wrap a single statement in a method.
In addition, I changed to use hook in TimelineCollectorManager, but postRemove is called before stopping the collector, because once the collector is stopped, the hook may not be able to do something with the stopped collector.
Moreover, I moved RMApp.stopTimelineCollector into FinalTransition. Suppose the collector only collects application lifecycle events, it doesn't need to stay after the app is finished. We can adjust it later if we find the collector needs to stay after the app is finished.
Hi zjshen, the latest patch looks good to me. Just one note is that we need to be careful when overriding the postRemove method. Since it's in a synchronized block, having a long postRemove method may significantly blocks the critical section.
I'm OK with all the rationales about the unfixed comments.
zjshen, you might want to check out junping_du's comments and my response in the other JIRA here: https://issues.apache.org/jira/browse/MAPREDUCE-6335?focusedCommentId=14508378&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14508378
I think those were small but useful changes. See this patch for the changes: https://issues.apache.org/jira/secure/attachment/12727521/YARN-3437.004.patch
It would be good to preserve those changes. Thanks!
sjlee0, would you please take a look at the new patch? I move collector.start/init out of sync block, because I think we just need to sync "put if absent" logic. And I remove the sync block for removal operation, as it seem not to be necessary given the map is already synchronized.
LGTM. Unless there are objections in the next hour or so, I will commit the patch.
I'm very sorry for revisiting this one more time after saying LGTM. But it occurred to me while looking at your patch.
You brought up a good point that collector.init() and collector.start() can be done without synchronization after put() is done (based on the collectorIsNew flag). I now realize that it is exactly the same semantics as ConcurrentHashMap, and we should be able to do this much more simply using ConcurrentHashMap. Sorry for the late change, but I propose this:
private final ConcurrentMap<ApplicationId,TimelineCollector> collectors = new ConcurrentHashMap<>(); ... public TimelineCollector putIfAbsent(ApplicationId appId, TimelineCollector collector) { TimelineCollector collectorInTable = collectors.putIfAbsent(appId, collector); if (collectorInTable == null) { LOG.info("the collector for " + appId + " was added"); collectorInTable = collector; // initialize, start, and add it to the collection so it can be // cleaned up when the parent shuts down collector.init(getConfig()); collector.start(); postPut(appId, collectorInTable); } else { LOG.info("the collector for " + appId + " already exists!"); } return collectorInTable; }
Other code remains the same. Essentially the synchronization block in the current patch collapses into a single call on CHM.putIfAbsent().
Previously, either junping_du or gtCarrera9 pointed out that there may be a race condition in terms of getting different collectors if CHM is used, but looking at this, I don't see that happening. Thoughts?
Hi sjlee0, I think the racy situation happens like this: thread 1 and 2 try to update the collector for the same appId. Thread 1 arrived first, performs the putIfAbsent with the incoming collector and succeed. Thread 2 then come and grab the collector put by thread 1. If thread 1 is blocked for some reason, thread 2 will operate on the uninitialized collector, which may cause potential problems. I'm not sure how severe this problem is, but we can avoid it by calling putIfAbsent after we set up a collector. To avoid unnecessary collector setup, we can firstly check the concurrent hash map, and only perform this operation if a collector is not present for this application id. Not sure if that makes sense...
sjlee0, gtCarrera9 and junping_du, given the discussion about the race of collector map is not settled, can we separate it in another jira, move on with v3 patch, and unblock this jira for RM writing timeline data?
Thanks gtCarrera9 for reminding me of this. You mentioned this in another JIRA but I couldn't find it.
Then v.3 does have the same issue in that collector.init() and collector.start() are done after the lock is released. Then I think we should go back to the previous version on this (collector.init() and collector.start() should be done inside the synchronization block so that no other thread can start using it before those operations are done).
Could we make that one revision? Then I think we're good to go.
Agree with sjlee0 that we may want to go back to the previous version for this (neither did I find that JIRA... ). Yes, let's address this in a separate JIRA.
It's still v.3 (as it includes a couple of other changes over v.2) but putIfAbsent() should be basically back to v.2 (LOG.error should still be LOG.info). In other words,
public TimelineCollector putIfAbsent(ApplicationId appId, TimelineCollector collector) { TimelineCollector collectorInTable = null; synchronized (collectors) { collectorInTable = collectors.get(appId); if (collectorInTable == null) { try { // initialize, start, and add it to the collection so it can be // cleaned up when the parent shuts down collector.init(getConfig()); collector.start(); collectors.put(appId, collector); LOG.info("the collector for " + appId + " was added"); collectorInTable = collector; postPut(appId, collectorInTable); } catch (Exception e) { throw new YarnRuntimeException(e); } } else { LOG.info("the collector for " + appId + " already exists!"); } } return collectorInTable; }
How's that?
The latest patch LGTM. I'll commit it shortly. Thanks for your patience!
Committed. Thanks much zjshen and Naganarasimha for working on the patch, and gtCarrera9 for your review!
sjlee0, congrats on your first commit! Thanks for review, Naga and Li!
SUCCESS: Integrated in Hadoop-trunk-Commit #10074 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10074/)
YARN-3390. Reuse TimelineCollectorManager for RM (Zhijie Shen via sjlee) (sjlee: rev 11e8905d8daf129afb6fe2e5a0eca11bcb1719c8)
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/test/java/org/apache/hadoop/yarn/server/timelineservice/collector/TestTimelineCollectorManager.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMContextImpl.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/timelineservice/RMTimelineCollectorManager.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/AppLevelTimelineCollector.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/NodeTimelineCollectorManager.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/TimelineCollector.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/timelineservice/RMTimelineCollector.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/timelineservice/TestTimelineServiceClientIntegration.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/test/java/org/apache/hadoop/yarn/server/timelineservice/collector/TestPerNodeTimelineCollectorsAuxService.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/PerNodeTimelineCollectorsAuxService.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/TimelineCollectorWebService.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/TimelineCollectorManager.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMActiveServiceContext.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMContext.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/test/java/org/apache/hadoop/yarn/server/timelineservice/collector/TestNMTimelineCollectorManager.java
Hi zjshen,
Shall i work on this jira ? as i can utilize the same in
YARN-3044?