Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.9.0, 3.0.0-alpha1
    • timelineserver
    • None
    • Reviewed

    Description

      RMTimelineCollector should have the context info of each app whose entity has been put

      Attachments

        1. YARN-3390.1.patch
          73 kB
          Zhijie Shen
        2. YARN-3390.2.patch
          71 kB
          Zhijie Shen
        3. YARN-3390.3.patch
          71 kB
          Zhijie Shen
        4. YARN-3390.4.patch
          71 kB
          Zhijie Shen

        Issue Links

          Activity

            Hi zjshen,
            Shall i work on this jira ? as i can utilize the same in YARN-3044 ?

            Naganarasimha Naganarasimha G R added a comment - Hi zjshen , Shall i work on this jira ? as i can utilize the same in YARN-3044 ?
            gtcarrera9 Li Lu added a comment -

            Hi zjshen, could you please confirm that this JIRA will also block all storage layer implementations? Or we can proceed after YARN-3040 is in? Thanks!

            gtcarrera9 Li Lu added a comment - Hi zjshen , could you please confirm that this JIRA will also block all storage layer implementations? Or we can proceed after YARN-3040 is in? Thanks!
            zjshen Zhijie Shen added a comment -

            It shouldn't. Storage layer implementations only depends on the writer interface, which is covered in YARN-3040.

            zjshen Zhijie Shen added a comment - It shouldn't. Storage layer implementations only depends on the writer interface, which is covered in YARN-3040 .
            Naganarasimha Naganarasimha G R added a comment - - edited

            Hi zjshen, junping_du & sjlee0
            TimelineCollector.getTimelineEntityContext() interface will not be suitable for the RMTimelineCollector as it will be posting/putting entities for multiple apps, appattempts and containers. Hence was initially planning to modify this method to take a TimelineEntity.Identifier as a parameter and @ RMTimelineCollector planning to hold a map of TimelineEntity.Identifier to AppId and another Map to hold AppId to TimelineEntityContext (required as context is created per app when appCreatedEvent occurs).
            But one more conflict which i can see is AppLevelTimelineCollector is specific for a app, so invoking getTimelineEntityContext in getTimelineEntityContext(TimelineEntities ,Ugi) is fine because all the entities which are posted can be assumed to have same context as they belong to a single app but in a general case (like RMTimelineCollector) its not guaranteed that all TimelineEntities belong to same app(i.e. TimelineEntities might have diff context). so would it be better to change the interface of TimelineCollector.putEntities) to accept the TimelineEntityContext as parameter and remove TimelineCollector.getTimelineEntityContext() method from interface ? please share your opinion...

            Naganarasimha Naganarasimha G R added a comment - - edited Hi zjshen , junping_du & sjlee0 TimelineCollector.getTimelineEntityContext() interface will not be suitable for the RMTimelineCollector as it will be posting/putting entities for multiple apps, appattempts and containers. Hence was initially planning to modify this method to take a TimelineEntity.Identifier as a parameter and @ RMTimelineCollector planning to hold a map of TimelineEntity.Identifier to AppId and another Map to hold AppId to TimelineEntityContext (required as context is created per app when appCreatedEvent occurs). But one more conflict which i can see is AppLevelTimelineCollector is specific for a app, so invoking getTimelineEntityContext in getTimelineEntityContext(TimelineEntities ,Ugi) is fine because all the entities which are posted can be assumed to have same context as they belong to a single app but in a general case (like RMTimelineCollector) its not guaranteed that all TimelineEntities belong to same app(i.e. TimelineEntities might have diff context). so would it be better to change the interface of TimelineCollector.putEntities) to accept the TimelineEntityContext as parameter and remove TimelineCollector.getTimelineEntityContext() method from interface ? please share your opinion...
            zjshen Zhijie Shen added a comment -

            I think it makes sense to generalize TimelineEntityContext from a single app's context to the app -> context map. Reader may need this map too. I'll fix the problem after YARN-3391 is done.

            zjshen Zhijie Shen added a comment - I think it makes sense to generalize TimelineEntityContext from a single app's context to the app -> context map. Reader may need this map too. I'll fix the problem after YARN-3391 is done.
            sjlee0 Sangjin Lee added a comment -

            I think we need to either pass in the context per call or have a map of app id to context. I would favor the latter approach because it'd be easier on the perspective of callers of putEntities().

            sjlee0 Sangjin Lee added a comment - I think we need to either pass in the context per call or have a map of app id to context. I would favor the latter approach because it'd be easier on the perspective of callers of putEntities().
            zjshen Zhijie Shen added a comment -

            I would favor the latter approach

            +1

            zjshen Zhijie Shen added a comment - I would favor the latter approach +1

            Thanks for the feedback zjshen & sjlee0,

            either pass in the context per call or have a map of app id to context. I would favor the latter approach because it'd be easier on the perspective of callers of putEntities().

            I too agree it will be easier easier on the perspective of callers of putEntities() but if we favor for map of app id to context

            • implicit assumption would be that {{putEntities(TimelineEntities ) }} will be for same appId(/will have have the same context)
            • TimelineEntities as such do not have appID explicitly, so planning to modify TimelineCollector.getTimelineEntityContext() to TimelineCollector.getTimelineEntityContext(TimelineEntity.Identifier id) and subclasses of TimelineCollector can take care of mapping the Id to the Context (via AppId) if required.
            • code of putEntities(TimelineEntities) would look something like
                  Iterator<TimelineEntity> iterator = entities.getEntities().iterator();
                  TimelineEntity next = (iterator.hasNext())?iterator.next():null;
                  if(null!=next) {
                    
                  TimelineCollectorContext context = getTimelineEntityContext(next.getIdentifier());
                  return writer.write(context.getClusterId(), context.getUserId(),
                      context.getFlowId(), context.getFlowRunId(), context.getAppId(),
                      entities);
                  }
              

            If its ok then shall i work on it ?

            Naganarasimha Naganarasimha G R added a comment - Thanks for the feedback zjshen & sjlee0 , either pass in the context per call or have a map of app id to context. I would favor the latter approach because it'd be easier on the perspective of callers of putEntities(). I too agree it will be easier easier on the perspective of callers of putEntities() but if we favor for map of app id to context implicit assumption would be that {{putEntities(TimelineEntities ) }} will be for same appId(/will have have the same context) TimelineEntities as such do not have appID explicitly, so planning to modify TimelineCollector.getTimelineEntityContext() to TimelineCollector.getTimelineEntityContext(TimelineEntity.Identifier id) and subclasses of TimelineCollector can take care of mapping the Id to the Context (via AppId) if required. code of putEntities(TimelineEntities) would look something like Iterator<TimelineEntity> iterator = entities.getEntities().iterator(); TimelineEntity next = (iterator.hasNext())?iterator.next(): null ; if ( null !=next) { TimelineCollectorContext context = getTimelineEntityContext(next.getIdentifier()); return writer.write(context.getClusterId(), context.getUserId(), context.getFlowId(), context.getFlowRunId(), context.getAppId(), entities); } If its ok then shall i work on it ?
            zjshen Zhijie Shen added a comment -

            Thanks for sharing your idea, Naga! Thinking about RM side use case again, an RMTimelineCollectorManager is better than an RMTimelineCollector with a map of appId -> collector context info. RMTimelineCollectorManager is similar to what we have to NM, i.e., TimelineCollectorManager. However, it doesn't need IPC interface and web server, but maintain the collection of app-level collector an RM side. The benefit is that we can reuse most of the code of app-level collector as well as collector manager, uniforming the way we write the timeline entities.

            The approach is to make TimelineCollectorManager abstract and contain the common logic of managing collector lifecycles, and have NMTimelineCollectorManager extend TimelineCollectorManager by adding the IRP interface and start the web server, and have RMTimelineCollectorManager extend TimelineCollectorManager by sourcing RMContext for the context info and coupling with RMApp to add/remove the app-level collector. I created patch for this approach.

            zjshen Zhijie Shen added a comment - Thanks for sharing your idea, Naga! Thinking about RM side use case again, an RMTimelineCollectorManager is better than an RMTimelineCollector with a map of appId -> collector context info. RMTimelineCollectorManager is similar to what we have to NM, i.e., TimelineCollectorManager. However, it doesn't need IPC interface and web server, but maintain the collection of app-level collector an RM side. The benefit is that we can reuse most of the code of app-level collector as well as collector manager, uniforming the way we write the timeline entities. The approach is to make TimelineCollectorManager abstract and contain the common logic of managing collector lifecycles, and have NMTimelineCollectorManager extend TimelineCollectorManager by adding the IRP interface and start the web server, and have RMTimelineCollectorManager extend TimelineCollectorManager by sourcing RMContext for the context info and coupling with RMApp to add/remove the app-level collector. I created patch for this approach.
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12725163/YARN-3390.1.patch
            against trunk revision b5a0b24.

            -1 patch. The patch command could not apply the patch.

            Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7327//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725163/YARN-3390.1.patch against trunk revision b5a0b24. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7327//console This message is automatically generated.
            zjshen Zhijie Shen added a comment -

            To help folks understand the patch, I introduce a bit more about the skeleton of the code change:

            • TimelineCollectorManager: the base class of managing collectors' life cycles
              • NMTimelineCollectorManager: extends TimelineCollectorManager, start web server to receive the incoming request and start the RPC channel for getting the context info publishing the service address.
              • RMTimelineCollectorManager: extends TimelineCollectorManager, and source rmContext for getting the context info.
            • ResourceManager: create RMTimelineCollectorManager and make it accessible inside the daemon via RMContext.
            • RMAppManager: add and remove app-level collector into/out of RMTimelineCollectorManager according to RMApp's lifecycle.

            After this, YARN-3044's scope is clearer, inside SystemMetricsPublisher, if we're using YTS v2, we compose the new timeline entity, pick the corresponding app-level collector from RMTimelineCollectorManager, and write through it. It's similar to what we're doing for writing DS and MR data to new timeline service.

            zjshen Zhijie Shen added a comment - To help folks understand the patch, I introduce a bit more about the skeleton of the code change: TimelineCollectorManager: the base class of managing collectors' life cycles NMTimelineCollectorManager: extends TimelineCollectorManager, start web server to receive the incoming request and start the RPC channel for getting the context info publishing the service address. RMTimelineCollectorManager: extends TimelineCollectorManager, and source rmContext for getting the context info. ResourceManager: create RMTimelineCollectorManager and make it accessible inside the daemon via RMContext. RMAppManager: add and remove app-level collector into/out of RMTimelineCollectorManager according to RMApp's lifecycle. After this, YARN-3044 's scope is clearer, inside SystemMetricsPublisher, if we're using YTS v2, we compose the new timeline entity, pick the corresponding app-level collector from RMTimelineCollectorManager, and write through it. It's similar to what we're doing for writing DS and MR data to new timeline service.

            Thanks for the patch zjshen, As earlier mentioned in Yarn-3044 if we go ahead with the notion that RM will be aware of RMCollector(manager) through RMcontext then in future if we want to segregate the dependencies b/w RM and Timeline service projects again need to do the changes in the core part (RMAppImpl and Resourcemanager), If the dependencies are fine then approach looks fine to me .
            Alternative way to avoid dependency is invoking RMTimelineCollectorManager's putIfAbsent() and remove(ApplicationId appId) when SMP needs to publish publishApplicationCreatedEvent and publishApplicationFinishedEvent. In the approach what i have taken in Yarn-3044, RMTimelineCollector(can be renamed to TimelineServiceV2Publisher which will be responsible for publishing v2 events) can invoke RMTimelineCollectorManager's putIfAbsent() and remove(ApplicationId appId).

            Naganarasimha Naganarasimha G R added a comment - Thanks for the patch zjshen , As earlier mentioned in Yarn-3044 if we go ahead with the notion that RM will be aware of RMCollector(manager) through RMcontext then in future if we want to segregate the dependencies b/w RM and Timeline service projects again need to do the changes in the core part (RMAppImpl and Resourcemanager), If the dependencies are fine then approach looks fine to me . Alternative way to avoid dependency is invoking RMTimelineCollectorManager's putIfAbsent() and remove(ApplicationId appId) when SMP needs to publish publishApplicationCreatedEvent and publishApplicationFinishedEvent. In the approach what i have taken in Yarn-3044, RMTimelineCollector(can be renamed to TimelineServiceV2Publisher which will be responsible for publishing v2 events) can invoke RMTimelineCollectorManager's putIfAbsent() and remove(ApplicationId appId) .
            gtcarrera9 Li Lu added a comment -

            I think for the problem of dependencies, our current project structure is too coarse-grained. Ideally we may want to have something like "timeline-client" for NM/RM/AMs to depend on. For now, the plan to reuse most of the collector's code LGTM. We can put app-collector mapping into the RM collector manager, just as storage layer connections. Chopping down dependencies in the future would be much easier than removing (mostly but not completely) duplicated collector code IMHO.

            gtcarrera9 Li Lu added a comment - I think for the problem of dependencies, our current project structure is too coarse-grained. Ideally we may want to have something like "timeline-client" for NM/RM/AMs to depend on. For now, the plan to reuse most of the collector's code LGTM. We can put app-collector mapping into the RM collector manager, just as storage layer connections. Chopping down dependencies in the future would be much easier than removing (mostly but not completely) duplicated collector code IMHO.
            sjlee0 Sangjin Lee added a comment -

            Thanks Naganarasimha and zjshen! I'll take a look at the patch a little more closely later. I just wanted to bring to your attention YARN-3437. There I did a fairly similar (but not identical) refactoring of TimelineCollectorManager as part of isolating the core piece of the timeline collector manager independent of the NM-TS interaction, etc. So hopefully we can arrive at a version that can satisfy both needs.

            Also, a small nit: I'm not too sure if I like the name "NMTimelineCollectorManager" (the "NM" part of it). It suggests bit too strongly that this is part of the NM. How about "NodeTimelineCollectorManager" or another bit more independent name? I can't think of many other alternatives, so any suggestion is welcome.

            sjlee0 Sangjin Lee added a comment - Thanks Naganarasimha and zjshen ! I'll take a look at the patch a little more closely later. I just wanted to bring to your attention YARN-3437 . There I did a fairly similar (but not identical) refactoring of TimelineCollectorManager as part of isolating the core piece of the timeline collector manager independent of the NM-TS interaction, etc. So hopefully we can arrive at a version that can satisfy both needs. Also, a small nit: I'm not too sure if I like the name "NMTimelineCollectorManager" (the "NM" part of it). It suggests bit too strongly that this is part of the NM. How about "NodeTimelineCollectorManager" or another bit more independent name? I can't think of many other alternatives, so any suggestion is welcome.
            zjshen Zhijie Shen added a comment -

            There I did a fairly similar (but not identical) refactoring of TimelineCollectorManager as part of isolating the core piece of the timeline collector manager independent of the NM-TS interaction, etc. So hopefully we can arrive at a version that can satisfy both needs.

            Interesting! It seems that we both find most of the code of TimelineCollectorManager in different places. I have quick glance and find two difference:

            1. I change appId type from string to ApplicationId, because I think the strong type, ApplicationId, is more commonly used in YARN, while I see we have trivial conversion from string to ApplicationId and then to string again.

            2. For putIfAbsent and remove, I don't use template method pattern, but let the subclass override the super class method and invoke it inside the override implementation, because I'm not sure if we will need pre process or post process, and if we only invoke the process when adding a new collector. If we're sure about template, I'm okay with the template pattern too.

            zjshen Zhijie Shen added a comment - There I did a fairly similar (but not identical) refactoring of TimelineCollectorManager as part of isolating the core piece of the timeline collector manager independent of the NM-TS interaction, etc. So hopefully we can arrive at a version that can satisfy both needs. Interesting! It seems that we both find most of the code of TimelineCollectorManager in different places. I have quick glance and find two difference: 1. I change appId type from string to ApplicationId, because I think the strong type, ApplicationId, is more commonly used in YARN, while I see we have trivial conversion from string to ApplicationId and then to string again. 2. For putIfAbsent and remove, I don't use template method pattern, but let the subclass override the super class method and invoke it inside the override implementation, because I'm not sure if we will need pre process or post process, and if we only invoke the process when adding a new collector. If we're sure about template, I'm okay with the template pattern too.
            sjlee0 Sangjin Lee added a comment -

            For putIfAbsent and remove, I don't use template method pattern, but let the subclass override the super class method and invoke it inside the override implementation, because I'm not sure if we will need pre process or post process, and if we only invoke the process when adding a new collector. If we're sure about template, I'm okay with the template pattern too.

            I'm fine with either approach. The main reason I thought of that is I wanted to be clear that the base implementation of putIfAbsent() and remove() is mandatory (i.e. not optional). Since we control all of it (base and subclasses), it might not be such a big deal either way.

            sjlee0 Sangjin Lee added a comment - For putIfAbsent and remove, I don't use template method pattern, but let the subclass override the super class method and invoke it inside the override implementation, because I'm not sure if we will need pre process or post process, and if we only invoke the process when adding a new collector. If we're sure about template, I'm okay with the template pattern too. I'm fine with either approach. The main reason I thought of that is I wanted to be clear that the base implementation of putIfAbsent() and remove() is mandatory (i.e. not optional). Since we control all of it (base and subclasses), it might not be such a big deal either way.
            sjlee0 Sangjin Lee added a comment -

            I took a pass at the patch, and it looks good for the most part. I would ask you to reconcile the TimelineCollectorManager changes with what I have over on YARN-3437. Again, I have a slight preference for the hook/template methods for the aforementioned reason, but it's not a strong preference one way or another.

            However, I'm not sure why there is a change for RMContainerAllocator.java. It doesn't look like an intended change?

            sjlee0 Sangjin Lee added a comment - I took a pass at the patch, and it looks good for the most part. I would ask you to reconcile the TimelineCollectorManager changes with what I have over on YARN-3437 . Again, I have a slight preference for the hook/template methods for the aforementioned reason, but it's not a strong preference one way or another. However, I'm not sure why there is a change for RMContainerAllocator.java. It doesn't look like an intended change?
            gtcarrera9 Li Lu added a comment -

            I skimmed through both patches. The two patches got significant overlap. Personally I'd incline to focus on YARN-3437 for now since it's critical to writer performance benchmark, which will further block the writer implementations. Writer implementation is on our critical path now to deliver an end-to-end preview of timeline v2. Therefore I'd prefer to give 3437 higher priority for now

            For the code, are there any special reasons why ConcurrentHashMap's fine-grained locking is not sufficient for collector manager? I think it may always be attractive to use finer-grained locking if we do not need strong consistency semantics, like snapshot or iterations.

            gtcarrera9 Li Lu added a comment - I skimmed through both patches. The two patches got significant overlap. Personally I'd incline to focus on YARN-3437 for now since it's critical to writer performance benchmark, which will further block the writer implementations. Writer implementation is on our critical path now to deliver an end-to-end preview of timeline v2. Therefore I'd prefer to give 3437 higher priority for now For the code, are there any special reasons why ConcurrentHashMap's fine-grained locking is not sufficient for collector manager? I think it may always be attractive to use finer-grained locking if we do not need strong consistency semantics, like snapshot or iterations.
            zjshen Zhijie Shen added a comment -

            The only conflict part between YARN-3437 and this Jira is TimelineCollectorManager base class. And we happened to resort to the similar refactoring method. I'm okay to commit YARN-3437 first. However, the comments about the base TimelineCollectorManager also apply. At least, I think we should use ApplicationId instead of String.

            zjshen Zhijie Shen added a comment - The only conflict part between YARN-3437 and this Jira is TimelineCollectorManager base class. And we happened to resort to the similar refactoring method. I'm okay to commit YARN-3437 first. However, the comments about the base TimelineCollectorManager also apply. At least, I think we should use ApplicationId instead of String.
            sjlee0 Sangjin Lee added a comment -

            Thanks Zhijie. I'll move forward with the existing patch for YARN-3437. You can still make the change of String => ApplicationId as part of this JIRA (as it involves more refactoring). How's that sound?

            sjlee0 Sangjin Lee added a comment - Thanks Zhijie. I'll move forward with the existing patch for YARN-3437 . You can still make the change of String => ApplicationId as part of this JIRA (as it involves more refactoring). How's that sound?
            zjshen Zhijie Shen added a comment -

            How's that sound?

            Sure, I'll take care of it.

            zjshen Zhijie Shen added a comment - How's that sound? Sure, I'll take care of it.
            gtcarrera9 Li Lu added a comment -

            Hi zjshen, thanks for the patch! Here are some of my comments. Most of them are quite minor:

            1. Changes in RMContainerAllocator.java appears to be irrelevant. Seems like this is changed by an IDE by mistake (on a refactoring)?
            2. In the following lines
              +    for (String tag : app.getApplicationTags()) {
              +      String value = null;
              +      if ((value = getFlowContext(TimelineUtils.FLOW_NAME_TAG_PREFIX, tag)) != null
              the first null assignment to value is marked as redundant
              +      if ((value = getFlowContext(TimelineUtils.FLOW_NAME_TAG_PREFIX, tag)) != null
              +          && !value.isEmpty()) {
              +          collector.getTimelineEntityContext().setFlowName(value);
              +      } else if ((value = getFlowContext(TimelineUtils.FLOW_VERSION_TAG_PREFIX, tag)) != null
              +          && !value.isEmpty()) {
              +        collector.getTimelineEntityContext().setFlowVersion(value);
              +      } else if ((value = getFlowContext(TimelineUtils.FLOW_RUN_ID_TAG_PREFIX, tag)) != null
              +          && !value.isEmpty()) {
              +        collector.getTimelineEntityContext().setFlowRunId(Long.valueOf(value));
              +      }
              

              Maybe we’d like to use a switch statement to deal with this? We may first split the tag into two parts, based on the first “:”, and then switch the first part of the returned array to set the second part of the array into flow name, version, and run id. Am I missing any fundamental obstacles for us to do this here? (String switch is available from Java 7)

            3. Rename MyNMTimelineCollectorManager in TestTimelineServiceClientIntegration with something indicating it's for testing?
            4. In the following lines:
              -  protected TimelineCollectorContext getTimelineEntityContext() {
              +  public TimelineCollectorContext getTimelineEntityContext() {
              

              We're exposing TimelineCollectorContext but we're not annotating the class. Even though we may treat unannotated classes as Audience.Private, maybe we'd like to mark it as unstable?

            5. In TimelineCollectorManager, I'm still having this question, although we may not want to address it in this JIRA: are there any special consistency requirements that prevent us from using ConcurrentHashMap?
            6. In TimelineCollectorWebService, why we're removing the utility function getCollector? I think we can reuse it when adding new web services.
            gtcarrera9 Li Lu added a comment - Hi zjshen , thanks for the patch! Here are some of my comments. Most of them are quite minor: Changes in RMContainerAllocator.java appears to be irrelevant. Seems like this is changed by an IDE by mistake (on a refactoring)? In the following lines + for ( String tag : app.getApplicationTags()) { + String value = null ; + if ((value = getFlowContext(TimelineUtils.FLOW_NAME_TAG_PREFIX, tag)) != null the first null assignment to value is marked as redundant + if ((value = getFlowContext(TimelineUtils.FLOW_NAME_TAG_PREFIX, tag)) != null + && !value.isEmpty()) { + collector.getTimelineEntityContext().setFlowName(value); + } else if ((value = getFlowContext(TimelineUtils.FLOW_VERSION_TAG_PREFIX, tag)) != null + && !value.isEmpty()) { + collector.getTimelineEntityContext().setFlowVersion(value); + } else if ((value = getFlowContext(TimelineUtils.FLOW_RUN_ID_TAG_PREFIX, tag)) != null + && !value.isEmpty()) { + collector.getTimelineEntityContext().setFlowRunId( Long .valueOf(value)); + } Maybe we’d like to use a switch statement to deal with this? We may first split the tag into two parts, based on the first “:”, and then switch the first part of the returned array to set the second part of the array into flow name, version, and run id. Am I missing any fundamental obstacles for us to do this here? (String switch is available from Java 7) Rename MyNMTimelineCollectorManager in TestTimelineServiceClientIntegration with something indicating it's for testing? In the following lines: - protected TimelineCollectorContext getTimelineEntityContext() { + public TimelineCollectorContext getTimelineEntityContext() { We're exposing TimelineCollectorContext but we're not annotating the class. Even though we may treat unannotated classes as Audience.Private, maybe we'd like to mark it as unstable? In TimelineCollectorManager, I'm still having this question, although we may not want to address it in this JIRA: are there any special consistency requirements that prevent us from using ConcurrentHashMap? In TimelineCollectorWebService, why we're removing the utility function getCollector ? I think we can reuse it when adding new web services.
            sjlee0 Sangjin Lee added a comment -

            In TimelineCollectorManager, I'm still having this question, although we may not want to address it in this JIRA: are there any special consistency requirements that prevent us from using ConcurrentHashMap?

            I can answer this as I added that code. In putIfAbsent(), it needs to start the collector as well if get() returns null. If we used ConcurrentHashMap and removed synchronization, multiple threads could start their own collectors unnecessarily. It is probably not a show stopper but less than desirable. Also, in real life the contention on TimelineCollectorManager is low enough that synchronization should be perfectly adequate. If we want to do this without synchronization, then we would want to use something like guava's LoadingCache.

            sjlee0 Sangjin Lee added a comment - In TimelineCollectorManager, I'm still having this question, although we may not want to address it in this JIRA: are there any special consistency requirements that prevent us from using ConcurrentHashMap? I can answer this as I added that code. In putIfAbsent(), it needs to start the collector as well if get() returns null. If we used ConcurrentHashMap and removed synchronization, multiple threads could start their own collectors unnecessarily. It is probably not a show stopper but less than desirable. Also, in real life the contention on TimelineCollectorManager is low enough that synchronization should be perfectly adequate. If we want to do this without synchronization, then we would want to use something like guava's LoadingCache.
            gtcarrera9 Li Lu added a comment -

            Hi sjlee0, thanks for the note! Your rational sounds quite reasonable so let keep the current design here. For now I'm OK with the coarse-grained synchronization.

            gtcarrera9 Li Lu added a comment - Hi sjlee0 , thanks for the note! Your rational sounds quite reasonable so let keep the current design here. For now I'm OK with the coarse-grained synchronization.
            zjshen Zhijie Shen added a comment -

            Thanks for the comments. I've addressed Sangjin and Li's comments except:

            maybe we'd like to mark it as unstable?

            It's not the API for the users, hence it's okay to leave it unannotated.

            In TimelineCollectorWebService, why we're removing the utility function getCollector?

            After the refactoring, we don't need to convert appId to string. It's not necessary to wrap a single statement in a method.

            In addition, I changed to use hook in TimelineCollectorManager, but postRemove is called before stopping the collector, because once the collector is stopped, the hook may not be able to do something with the stopped collector.

            Moreover, I moved RMApp.stopTimelineCollector into FinalTransition. Suppose the collector only collects application lifecycle events, it doesn't need to stay after the app is finished. We can adjust it later if we find the collector needs to stay after the app is finished.

            zjshen Zhijie Shen added a comment - Thanks for the comments. I've addressed Sangjin and Li's comments except: maybe we'd like to mark it as unstable? It's not the API for the users, hence it's okay to leave it unannotated. In TimelineCollectorWebService, why we're removing the utility function getCollector? After the refactoring, we don't need to convert appId to string. It's not necessary to wrap a single statement in a method. In addition, I changed to use hook in TimelineCollectorManager, but postRemove is called before stopping the collector, because once the collector is stopped, the hook may not be able to do something with the stopped collector. Moreover, I moved RMApp.stopTimelineCollector into FinalTransition. Suppose the collector only collects application lifecycle events, it doesn't need to stay after the app is finished. We can adjust it later if we find the collector needs to stay after the app is finished.
            gtcarrera9 Li Lu added a comment -

            Hi zjshen, the latest patch looks good to me. Just one note is that we need to be careful when overriding the postRemove method. Since it's in a synchronized block, having a long postRemove method may significantly blocks the critical section.

            I'm OK with all the rationales about the unfixed comments.

            gtcarrera9 Li Lu added a comment - Hi zjshen , the latest patch looks good to me. Just one note is that we need to be careful when overriding the postRemove method. Since it's in a synchronized block, having a long postRemove method may significantly blocks the critical section. I'm OK with all the rationales about the unfixed comments.
            sjlee0 Sangjin Lee added a comment - - edited

            zjshen, you might want to check out junping_du's comments and my response in the other JIRA here: https://issues.apache.org/jira/browse/MAPREDUCE-6335?focusedCommentId=14508378&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14508378

            I think those were small but useful changes. See this patch for the changes: https://issues.apache.org/jira/secure/attachment/12727521/YARN-3437.004.patch

            It would be good to preserve those changes. Thanks!

            sjlee0 Sangjin Lee added a comment - - edited zjshen , you might want to check out junping_du 's comments and my response in the other JIRA here: https://issues.apache.org/jira/browse/MAPREDUCE-6335?focusedCommentId=14508378&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14508378 I think those were small but useful changes. See this patch for the changes: https://issues.apache.org/jira/secure/attachment/12727521/YARN-3437.004.patch It would be good to preserve those changes. Thanks!
            zjshen Zhijie Shen added a comment -

            sjlee0, would you please take a look at the new patch? I move collector.start/init out of sync block, because I think we just need to sync "put if absent" logic. And I remove the sync block for removal operation, as it seem not to be necessary given the map is already synchronized.

            zjshen Zhijie Shen added a comment - sjlee0 , would you please take a look at the new patch? I move collector.start/init out of sync block, because I think we just need to sync "put if absent" logic. And I remove the sync block for removal operation, as it seem not to be necessary given the map is already synchronized.
            sjlee0 Sangjin Lee added a comment -

            LGTM. Unless there are objections in the next hour or so, I will commit the patch.

            sjlee0 Sangjin Lee added a comment - LGTM. Unless there are objections in the next hour or so, I will commit the patch.
            sjlee0 Sangjin Lee added a comment - - edited

            I'm very sorry for revisiting this one more time after saying LGTM. But it occurred to me while looking at your patch.

            You brought up a good point that collector.init() and collector.start() can be done without synchronization after put() is done (based on the collectorIsNew flag). I now realize that it is exactly the same semantics as ConcurrentHashMap, and we should be able to do this much more simply using ConcurrentHashMap. Sorry for the late change, but I propose this:

            private final ConcurrentMap<ApplicationId,TimelineCollector> collectors = new ConcurrentHashMap<>();
            ...
            public TimelineCollector putIfAbsent(ApplicationId appId, TimelineCollector collector) {
              TimelineCollector collectorInTable = collectors.putIfAbsent(appId, collector);
              if (collectorInTable == null) {
                LOG.info("the collector for " + appId + " was added");
                collectorInTable = collector;
                // initialize, start, and add it to the collection so it can be
                // cleaned up when the parent shuts down
                collector.init(getConfig());
                collector.start();
                postPut(appId, collectorInTable);
              } else {
                LOG.info("the collector for " + appId + " already exists!");
              }
              return collectorInTable;
            }
            

            Other code remains the same. Essentially the synchronization block in the current patch collapses into a single call on CHM.putIfAbsent().

            Previously, either junping_du or gtCarrera9 pointed out that there may be a race condition in terms of getting different collectors if CHM is used, but looking at this, I don't see that happening. Thoughts?

            sjlee0 Sangjin Lee added a comment - - edited I'm very sorry for revisiting this one more time after saying LGTM. But it occurred to me while looking at your patch. You brought up a good point that collector.init() and collector.start() can be done without synchronization after put() is done (based on the collectorIsNew flag). I now realize that it is exactly the same semantics as ConcurrentHashMap, and we should be able to do this much more simply using ConcurrentHashMap. Sorry for the late change, but I propose this: private final ConcurrentMap<ApplicationId,TimelineCollector> collectors = new ConcurrentHashMap<>(); ... public TimelineCollector putIfAbsent(ApplicationId appId, TimelineCollector collector) { TimelineCollector collectorInTable = collectors.putIfAbsent(appId, collector); if (collectorInTable == null ) { LOG.info( "the collector for " + appId + " was added" ); collectorInTable = collector; // initialize, start, and add it to the collection so it can be // cleaned up when the parent shuts down collector.init(getConfig()); collector.start(); postPut(appId, collectorInTable); } else { LOG.info( "the collector for " + appId + " already exists!" ); } return collectorInTable; } Other code remains the same. Essentially the synchronization block in the current patch collapses into a single call on CHM.putIfAbsent(). Previously, either junping_du or gtCarrera9 pointed out that there may be a race condition in terms of getting different collectors if CHM is used, but looking at this, I don't see that happening. Thoughts?
            gtcarrera9 Li Lu added a comment -

            Hi sjlee0, I think the racy situation happens like this: thread 1 and 2 try to update the collector for the same appId. Thread 1 arrived first, performs the putIfAbsent with the incoming collector and succeed. Thread 2 then come and grab the collector put by thread 1. If thread 1 is blocked for some reason, thread 2 will operate on the uninitialized collector, which may cause potential problems. I'm not sure how severe this problem is, but we can avoid it by calling putIfAbsent after we set up a collector. To avoid unnecessary collector setup, we can firstly check the concurrent hash map, and only perform this operation if a collector is not present for this application id. Not sure if that makes sense...

            gtcarrera9 Li Lu added a comment - Hi sjlee0 , I think the racy situation happens like this: thread 1 and 2 try to update the collector for the same appId. Thread 1 arrived first, performs the putIfAbsent with the incoming collector and succeed. Thread 2 then come and grab the collector put by thread 1. If thread 1 is blocked for some reason, thread 2 will operate on the uninitialized collector, which may cause potential problems. I'm not sure how severe this problem is, but we can avoid it by calling putIfAbsent after we set up a collector. To avoid unnecessary collector setup, we can firstly check the concurrent hash map, and only perform this operation if a collector is not present for this application id. Not sure if that makes sense...
            zjshen Zhijie Shen added a comment - - edited

            sjlee0, gtCarrera9 and junping_du, given the discussion about the race of collector map is not settled, can we separate it in another jira, move on with v3 patch, and unblock this jira for RM writing timeline data?

            zjshen Zhijie Shen added a comment - - edited sjlee0 , gtCarrera9 and junping_du , given the discussion about the race of collector map is not settled, can we separate it in another jira, move on with v3 patch, and unblock this jira for RM writing timeline data?
            sjlee0 Sangjin Lee added a comment -

            Thanks gtCarrera9 for reminding me of this. You mentioned this in another JIRA but I couldn't find it.

            Then v.3 does have the same issue in that collector.init() and collector.start() are done after the lock is released. Then I think we should go back to the previous version on this (collector.init() and collector.start() should be done inside the synchronization block so that no other thread can start using it before those operations are done).

            Could we make that one revision? Then I think we're good to go.

            sjlee0 Sangjin Lee added a comment - Thanks gtCarrera9 for reminding me of this. You mentioned this in another JIRA but I couldn't find it. Then v.3 does have the same issue in that collector.init() and collector.start() are done after the lock is released. Then I think we should go back to the previous version on this (collector.init() and collector.start() should be done inside the synchronization block so that no other thread can start using it before those operations are done). Could we make that one revision? Then I think we're good to go.
            gtcarrera9 Li Lu added a comment -

            Agree with sjlee0 that we may want to go back to the previous version for this (neither did I find that JIRA... ). Yes, let's address this in a separate JIRA.

            gtcarrera9 Li Lu added a comment - Agree with sjlee0 that we may want to go back to the previous version for this (neither did I find that JIRA... ). Yes, let's address this in a separate JIRA.
            zjshen Zhijie Shen added a comment -

            So are you okay to reuse v2 patch, which is the previous version?

            zjshen Zhijie Shen added a comment - So are you okay to reuse v2 patch, which is the previous version?
            sjlee0 Sangjin Lee added a comment -

            It's still v.3 (as it includes a couple of other changes over v.2) but putIfAbsent() should be basically back to v.2 (LOG.error should still be LOG.info). In other words,

              public TimelineCollector putIfAbsent(ApplicationId appId,
                  TimelineCollector collector) {
                TimelineCollector collectorInTable = null;
                synchronized (collectors) {
                  collectorInTable = collectors.get(appId);
                  if (collectorInTable == null) {
                    try {
                      // initialize, start, and add it to the collection so it can be
                      // cleaned up when the parent shuts down
                      collector.init(getConfig());
                      collector.start();
                      collectors.put(appId, collector);
                      LOG.info("the collector for " + appId + " was added");
                      collectorInTable = collector;
                      postPut(appId, collectorInTable);
                    } catch (Exception e) {
                      throw new YarnRuntimeException(e);
                    }
                  } else {
                    LOG.info("the collector for " + appId + " already exists!");
                  }
                }
                return collectorInTable;
              }
            

            How's that?

            sjlee0 Sangjin Lee added a comment - It's still v.3 (as it includes a couple of other changes over v.2) but putIfAbsent() should be basically back to v.2 (LOG.error should still be LOG.info). In other words, public TimelineCollector putIfAbsent(ApplicationId appId, TimelineCollector collector) { TimelineCollector collectorInTable = null ; synchronized (collectors) { collectorInTable = collectors.get(appId); if (collectorInTable == null ) { try { // initialize, start, and add it to the collection so it can be // cleaned up when the parent shuts down collector.init(getConfig()); collector.start(); collectors.put(appId, collector); LOG.info( "the collector for " + appId + " was added" ); collectorInTable = collector; postPut(appId, collectorInTable); } catch (Exception e) { throw new YarnRuntimeException(e); } } else { LOG.info( "the collector for " + appId + " already exists!" ); } } return collectorInTable; } How's that?
            zjshen Zhijie Shen added a comment -

            Update the patch accordingly.

            zjshen Zhijie Shen added a comment - Update the patch accordingly.
            sjlee0 Sangjin Lee added a comment -

            The latest patch LGTM. I'll commit it shortly. Thanks for your patience!

            sjlee0 Sangjin Lee added a comment - The latest patch LGTM. I'll commit it shortly. Thanks for your patience!
            gtcarrera9 Li Lu added a comment -

            Patch LGTM, thanks zjshen!

            gtcarrera9 Li Lu added a comment - Patch LGTM, thanks zjshen !
            sjlee0 Sangjin Lee added a comment -

            Committed. Thanks much zjshen and Naganarasimha for working on the patch, and gtCarrera9 for your review!

            sjlee0 Sangjin Lee added a comment - Committed. Thanks much zjshen and Naganarasimha for working on the patch, and gtCarrera9 for your review!
            zjshen Zhijie Shen added a comment -

            sjlee0, congrats on your first commit! Thanks for review, Naga and Li!

            zjshen Zhijie Shen added a comment - sjlee0 , congrats on your first commit! Thanks for review, Naga and Li!
            hudson Hudson added a comment -

            SUCCESS: Integrated in Hadoop-trunk-Commit #10074 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10074/)
            YARN-3390. Reuse TimelineCollectorManager for RM (Zhijie Shen via sjlee) (sjlee: rev 11e8905d8daf129afb6fe2e5a0eca11bcb1719c8)

            • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/test/java/org/apache/hadoop/yarn/server/timelineservice/collector/TestTimelineCollectorManager.java
            • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMContextImpl.java
            • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/timelineservice/RMTimelineCollectorManager.java
            • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/AppLevelTimelineCollector.java
            • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/NodeTimelineCollectorManager.java
            • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/TimelineCollector.java
            • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
            • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/timelineservice/RMTimelineCollector.java
            • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/timelineservice/TestTimelineServiceClientIntegration.java
            • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
            • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/test/java/org/apache/hadoop/yarn/server/timelineservice/collector/TestPerNodeTimelineCollectorsAuxService.java
            • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/PerNodeTimelineCollectorsAuxService.java
            • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/TimelineCollectorWebService.java
            • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/TimelineCollectorManager.java
            • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java
            • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMActiveServiceContext.java
            • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMContext.java
            • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/test/java/org/apache/hadoop/yarn/server/timelineservice/collector/TestNMTimelineCollectorManager.java
            hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #10074 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10074/ ) YARN-3390 . Reuse TimelineCollectorManager for RM (Zhijie Shen via sjlee) (sjlee: rev 11e8905d8daf129afb6fe2e5a0eca11bcb1719c8) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/test/java/org/apache/hadoop/yarn/server/timelineservice/collector/TestTimelineCollectorManager.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMContextImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/timelineservice/RMTimelineCollectorManager.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/AppLevelTimelineCollector.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/NodeTimelineCollectorManager.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/TimelineCollector.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/timelineservice/RMTimelineCollector.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/timelineservice/TestTimelineServiceClientIntegration.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/test/java/org/apache/hadoop/yarn/server/timelineservice/collector/TestPerNodeTimelineCollectorsAuxService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/PerNodeTimelineCollectorsAuxService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/TimelineCollectorWebService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/TimelineCollectorManager.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMActiveServiceContext.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMContext.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/test/java/org/apache/hadoop/yarn/server/timelineservice/collector/TestNMTimelineCollectorManager.java

            People

              zjshen Zhijie Shen
              zjshen Zhijie Shen
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: