Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.9.0, 3.0.0-alpha1
    • None
    • None
    • Reviewed

    Description

      This subtask covers the work for converting the proposed patch for the load test driver (YARN-2556) to work with the timeline service v.2.

      Attachments

        1. MAPREDUCE-6335.005.patch
          17 kB
          Sangjin Lee
        2. MAPREDUCE-6335.006.patch
          14 kB
          Sangjin Lee
        3. YARN-3437.001.patch
          28 kB
          Sangjin Lee
        4. YARN-3437.002.patch
          28 kB
          Sangjin Lee
        5. YARN-3437.003.patch
          28 kB
          Sangjin Lee
        6. YARN-3437.004.patch
          28 kB
          Sangjin Lee

        Issue Links

          Activity

            sjlee0 Sangjin Lee added a comment -

            Patch v.1 posted.

            This is basically a modification of the YARN-2556 patch (and clean-up of issues etc.) to work against the timeline service v.2.

            Since the new distributed timeline service collectors are tied to applications, I chose the approach of instantiating the base timeline collector within the mapper task, rather than going through the timeline client. Making it go through the timeline client has a number of challenges (see YARN-3378). But this should be still effective as a way to exercise the bulk of the write performance and scalability.

            You can try this out by doing for example

            hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-SNAPSHOT-tests.jar timelineperformance -m 10 -t 1000
            

            You'll get the output like

            TRANSACTION RATE (per mapper): 5027.652086 ops/s
            IO RATE (per mapper): 5027.652086 KB/s
            TRANSACTION RATE (total): 50276.520865 ops/s
            IO RATE (total): 50276.520865 KB/s
            

            It is still using pretty simple entities to write to the storage. I'll work on adding handling job history files later in a different JIRA.

            I would greatly appreciate your review. Thanks!

            sjlee0 Sangjin Lee added a comment - Patch v.1 posted. This is basically a modification of the YARN-2556 patch (and clean-up of issues etc.) to work against the timeline service v.2. Since the new distributed timeline service collectors are tied to applications, I chose the approach of instantiating the base timeline collector within the mapper task, rather than going through the timeline client. Making it go through the timeline client has a number of challenges (see YARN-3378 ). But this should be still effective as a way to exercise the bulk of the write performance and scalability. You can try this out by doing for example hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-SNAPSHOT-tests.jar timelineperformance -m 10 -t 1000 You'll get the output like TRANSACTION RATE (per mapper): 5027.652086 ops/s IO RATE (per mapper): 5027.652086 KB/s TRANSACTION RATE (total): 50276.520865 ops/s IO RATE (total): 50276.520865 KB/s It is still using pretty simple entities to write to the storage. I'll work on adding handling job history files later in a different JIRA. I would greatly appreciate your review. Thanks!
            sjlee0 Sangjin Lee added a comment -

            Added a few folks for review.

            sjlee0 Sangjin Lee added a comment - Added a few folks for review.
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12709078/YARN-3437.001.patch
            against trunk revision 6a6a59d.

            -1 patch. The patch command could not apply the patch.

            Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7206//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12709078/YARN-3437.001.patch against trunk revision 6a6a59d. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7206//console This message is automatically generated.
            junping_du Junping Du added a comment -

            Thanks sjlee0 for delivering a patch here!
            Just quickly go through the patch, looks like we are generating one app collector per map task. I think this is good for scalability test on backend storage which can be a bottleneck in mainstream cases. In addition, do we want to address some extreme cases, e.g. a huge applications will have hundreds of thousands or even millions tasks? If so, then may be we want to know a single app collector's bottleneck as well for accepting/forwarding messages from hundreds of thousands maps. Also, in a real cluster, the mapping from cluster to app, and app to tasks are all 1-N mapping. May be making app aggregator number configurable (just like map task number, and byte per map, etc.) is something we can do for next step?
            BTW, it has some duplicated code with YARN-2556 (like TimelineServerPerformance.java). Looks like YARN-2556 is in pretty good shape and possible to go to trunk and branch-2 quickly. I would remind to keep watching that JIRA status and do necessary rebase work if that patch go in and we may want to merge it into YARN-2928 branch soon.

            junping_du Junping Du added a comment - Thanks sjlee0 for delivering a patch here! Just quickly go through the patch, looks like we are generating one app collector per map task. I think this is good for scalability test on backend storage which can be a bottleneck in mainstream cases. In addition, do we want to address some extreme cases, e.g. a huge applications will have hundreds of thousands or even millions tasks? If so, then may be we want to know a single app collector's bottleneck as well for accepting/forwarding messages from hundreds of thousands maps. Also, in a real cluster, the mapping from cluster to app, and app to tasks are all 1-N mapping. May be making app aggregator number configurable (just like map task number, and byte per map, etc.) is something we can do for next step? BTW, it has some duplicated code with YARN-2556 (like TimelineServerPerformance.java). Looks like YARN-2556 is in pretty good shape and possible to go to trunk and branch-2 quickly. I would remind to keep watching that JIRA status and do necessary rebase work if that patch go in and we may want to merge it into YARN-2928 branch soon.
            sjlee0 Sangjin Lee added a comment - - edited

            Thanks junping_du for your comments!

            I agree with most of your comments. This particular subtask generates a pretty simple model (just a straightforward conversion of YARN-2556) where each map task creates one app collector and simple entities. But we will add more test modes that will have different characteristics. For one, we will add a mode where it iterates over MR job history files and pump them into the timeline storage.

            The number of mappers is a pretty useful way of controlling the parallelism of tests, and we can add more parameters to have finer-grained parallelism controls (one app collector manager having multiple app collectors, etc.).

            BTW, it has some duplicated code with YARN-2556 (like TimelineServerPerformance.java).

            That's because I copied it (as mentioned in the description). I had to fix a few issues with the original patch there to get it going. Some appear to be unrelated to porting it to the timeline service v.2. I'll add some comments there to help it get committed.

            sjlee0 Sangjin Lee added a comment - - edited Thanks junping_du for your comments! I agree with most of your comments. This particular subtask generates a pretty simple model (just a straightforward conversion of YARN-2556 ) where each map task creates one app collector and simple entities. But we will add more test modes that will have different characteristics. For one, we will add a mode where it iterates over MR job history files and pump them into the timeline storage. The number of mappers is a pretty useful way of controlling the parallelism of tests, and we can add more parameters to have finer-grained parallelism controls (one app collector manager having multiple app collectors, etc.). BTW, it has some duplicated code with YARN-2556 (like TimelineServerPerformance.java). That's because I copied it (as mentioned in the description). I had to fix a few issues with the original patch there to get it going. Some appear to be unrelated to porting it to the timeline service v.2. I'll add some comments there to help it get committed.
            sjlee0 Sangjin Lee added a comment -

            Forgot to add that I copied it to get the load testing going. Once YARN-2556 gets committed on trunk, I'd need to update this to handle both the old timeline service and the new timeline service.

            sjlee0 Sangjin Lee added a comment - Forgot to add that I copied it to get the load testing going. Once YARN-2556 gets committed on trunk, I'd need to update this to handle both the old timeline service and the new timeline service.
            sjlee0 Sangjin Lee added a comment -

            Rebased the patch with the latest from the YARN-2928 branch.

            sjlee0 Sangjin Lee added a comment - Rebased the patch with the latest from the YARN-2928 branch.
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12725758/YARN-3437.002.patch
            against trunk revision 1b89a3e.

            -1 patch. The patch command could not apply the patch.

            Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7354//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725758/YARN-3437.002.patch against trunk revision 1b89a3e. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7354//console This message is automatically generated.
            zjshen Zhijie Shen added a comment -

            Per my comment on YARN-3390 . Please feel free to move forward.

            zjshen Zhijie Shen added a comment - Per my comment on YARN-3390 . Please feel free to move forward.
            sjlee0 Sangjin Lee added a comment -

            Thanks zjshen! Could someone please commit this patch unless there are further comments?

            sjlee0 Sangjin Lee added a comment - Thanks zjshen ! Could someone please commit this patch unless there are further comments?
            zjshen Zhijie Shen added a comment -

            Will take a look today.

            zjshen Zhijie Shen added a comment - Will take a look today.

            Now that I have dug into timeline server performance (YARN-3448). I have a better understanding of what type of writes are costly. For example, a single entity will generate dozens or writes to the database. The number of primary keys, the number of related entities, and the write batch size (entities per put) greatly affect the time an entity put takes. While this is a good start, I think there should at least be a follow up that addresses these issues to better measure the write performance.

            jeagles Jonathan Turner Eagles added a comment - Now that I have dug into timeline server performance ( YARN-3448 ). I have a better understanding of what type of writes are costly. For example, a single entity will generate dozens or writes to the database. The number of primary keys, the number of related entities, and the write batch size (entities per put) greatly affect the time an entity put takes. While this is a good start, I think there should at least be a follow up that addresses these issues to better measure the write performance.
            sjlee0 Sangjin Lee added a comment -

            jeagles, thanks for your comment.

            Yes, I agree the level of reporting in this patch is mostly basic and very high level. I definitely agree that we need to add more reporting in terms of the actual number of writes/puts done on the storage, etc. The first use of this tool is to compare the overall throughput/performance of the storage implementations and identify high level issues. I suspect for that the coarse-grained reporting might be enough.

            As a follow-up JIRA, I'll add a new JIRA that reports more fine-grained write performance data.

            sjlee0 Sangjin Lee added a comment - jeagles , thanks for your comment. Yes, I agree the level of reporting in this patch is mostly basic and very high level. I definitely agree that we need to add more reporting in terms of the actual number of writes/puts done on the storage, etc. The first use of this tool is to compare the overall throughput/performance of the storage implementations and identify high level issues. I suspect for that the coarse-grained reporting might be enough. As a follow-up JIRA, I'll add a new JIRA that reports more fine-grained write performance data.
            sjlee0 Sangjin Lee added a comment -

            Added YARN-3512.

            sjlee0 Sangjin Lee added a comment - Added YARN-3512 .
            junping_du Junping Du added a comment -

            Filed a new JIRA to track fine-grained performance data sounds good to me. Given the patch here have duplicated code with YARN-2556, I would like to understand what's our plan for YARN-2556. jeagles, can you share your vision on this?
            Looks like this JIRA block YARN-3390 (a refactor JIRA) which block YARN-3044 (RM writing events to v2 ATS service). I would like to have a clear path to make all patches goes in as a pipeline with getting ride of any potential deadlock.
            May be the first step is to make YARN-2556 get committed it, and get patch here rebased? jeagles, sjlee0 and zjshen, what's your opinion on this?

            junping_du Junping Du added a comment - Filed a new JIRA to track fine-grained performance data sounds good to me. Given the patch here have duplicated code with YARN-2556 , I would like to understand what's our plan for YARN-2556 . jeagles , can you share your vision on this? Looks like this JIRA block YARN-3390 (a refactor JIRA) which block YARN-3044 (RM writing events to v2 ATS service). I would like to have a clear path to make all patches goes in as a pipeline with getting ride of any potential deadlock. May be the first step is to make YARN-2556 get committed it, and get patch here rebased? jeagles , sjlee0 and zjshen , what's your opinion on this?
            sjlee0 Sangjin Lee added a comment -

            I think we need to make progress on this as this is blocking other JIRAs and also it's tied to the schema evaluation.

            My vote is to get this committed, and adjust this once YARN-2556 lands and we rebase. Thoughts?

            sjlee0 Sangjin Lee added a comment - I think we need to make progress on this as this is blocking other JIRAs and also it's tied to the schema evaluation. My vote is to get this committed, and adjust this once YARN-2556 lands and we rebase. Thoughts?
            junping_du Junping Du added a comment -

            I agree too if we don't have clear plan for YARN-2556 so far. There should be no reason to block other going efforts.
            An suggestion (optional only) is: can we adjust name (or package path) slightly for duplicated file (TimelineServerPerformance.java) with YARN-2556? We can have an additional patch to remove duplicated file when YARN-2556 get in trunk.
            I assume this could be easier for YARN-2928 rebase back to trunk/branch-2 as less conflict. Thoughts?

            junping_du Junping Du added a comment - I agree too if we don't have clear plan for YARN-2556 so far. There should be no reason to block other going efforts. An suggestion (optional only) is: can we adjust name (or package path) slightly for duplicated file (TimelineServerPerformance.java) with YARN-2556 ? We can have an additional patch to remove duplicated file when YARN-2556 get in trunk. I assume this could be easier for YARN-2928 rebase back to trunk/branch-2 as less conflict. Thoughts?
            sjlee0 Sangjin Lee added a comment -

            The downside is code duplication, but as you said we would not be impacted when/if YARN-2556 lands. If others are comfortable with that, I could quickly move it to a different package. Let me know soon, and I'll update the patch.

            sjlee0 Sangjin Lee added a comment - The downside is code duplication, but as you said we would not be impacted when/if YARN-2556 lands. If others are comfortable with that, I could quickly move it to a different package. Let me know soon, and I'll update the patch.
            junping_du Junping Du added a comment -

            The downside is code duplication.

            Agree. However, I think this is the least price we could pay when developing on different branches in parallel.

            If others are comfortable with that, I could quickly move it to a different package.

            I am OK with this in case we have additional JIRA to track removing duplicated code after YARN-2556 land. We can also add some credit to contributors of YARN-2556 in commit/CHANGE messages which should be very common.

            zjshen and jeagles, any ideas on this?

            junping_du Junping Du added a comment - The downside is code duplication. Agree. However, I think this is the least price we could pay when developing on different branches in parallel. If others are comfortable with that, I could quickly move it to a different package. I am OK with this in case we have additional JIRA to track removing duplicated code after YARN-2556 land. We can also add some credit to contributors of YARN-2556 in commit/CHANGE messages which should be very common. zjshen and jeagles , any ideas on this?
            sjlee0 Sangjin Lee added a comment -

            Patch v.3. It's just a rename of the test class (with a couple of constants renamed accordingly) compared to patch v.2.

            sjlee0 Sangjin Lee added a comment - Patch v.3. It's just a rename of the test class (with a couple of constants renamed accordingly) compared to patch v.2.
            hadoopqa Hadoop QA added a comment -



            -1 overall



            Vote Subsystem Runtime Comment
            -1 patch 0m 0s The patch command could not apply the patch during dryrun.



            Subsystem Report/Notes
            Patch URL http://issues.apache.org/jira/secure/attachment/12727346/YARN-3437.003.patch
            Optional Tests javac unit findbugs checkstyle javadoc
            git revision trunk / 0ebe84d
            Console output https://builds.apache.org/job/PreCommit-YARN-Build/7459//console

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 patch 0m 0s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12727346/YARN-3437.003.patch Optional Tests javac unit findbugs checkstyle javadoc git revision trunk / 0ebe84d Console output https://builds.apache.org/job/PreCommit-YARN-Build/7459//console This message was automatically generated.
            sjlee0 Sangjin Lee added a comment -

            Could you kindly take a look at the latest patch? Thanks!

            sjlee0 Sangjin Lee added a comment - Could you kindly take a look at the latest patch? Thanks!
            junping_du Junping Du added a comment -

            Thanks sjlee0 for updating the patch!
            The latest patch looks good in overall, some minor comments:

            +  public TimelineCollector putIfAbsent(ApplicationId appId,
            +      TimelineCollector collector) {
            +    String id = appId.toString();
            +    TimelineCollector collectorInTable;
            +    boolean collectorIsNew = false;
            +    synchronized (collectors) {
            +      collectorInTable = collectors.get(id);
            +      if (collectorInTable == null) {
            +        try {
            +          // initialize, start, and add it to the collection so it can be
            +          // cleaned up when the parent shuts down
            +          collector.init(getConfig());
            +          collector.start();
            +          collectors.put(id, collector);
            +          LOG.info("the collector for " + id + " was added");
            +          collectorInTable = collector;
            +          collectorIsNew = true;
            +        } catch (Exception e) {
            +          throw new YarnRuntimeException(e);
            +        }
            +      } else {
            +        String msg = "the collector for " + id + " already exists!";
            +        LOG.error(msg);
            +      }
            +    }
            +
            +    if (collectorIsNew) {
            +      postPut(appId, collector);
            +    }
            +
            +    return collectorInTable;
            +  }
            

            I understand this code piece is moved from other place. However, I think it need to be improved:

            • For performance perspective, we should move LOG.info() out of synchronized block (may be move out of collector.start()?).
            • we don't need to LOG.ERROR (replace with INFO?) if collector exists, general semantic for putIfAbsent should allow put the same object in concurrent threads.

            For remove(), similar that we should move collector.stop() and LOG.info() out of synchronized block.

            junping_du Junping Du added a comment - Thanks sjlee0 for updating the patch! The latest patch looks good in overall, some minor comments: + public TimelineCollector putIfAbsent(ApplicationId appId, + TimelineCollector collector) { + String id = appId.toString(); + TimelineCollector collectorInTable; + boolean collectorIsNew = false ; + synchronized (collectors) { + collectorInTable = collectors.get(id); + if (collectorInTable == null ) { + try { + // initialize, start, and add it to the collection so it can be + // cleaned up when the parent shuts down + collector.init(getConfig()); + collector.start(); + collectors.put(id, collector); + LOG.info( "the collector for " + id + " was added" ); + collectorInTable = collector; + collectorIsNew = true ; + } catch (Exception e) { + throw new YarnRuntimeException(e); + } + } else { + String msg = "the collector for " + id + " already exists!" ; + LOG.error(msg); + } + } + + if (collectorIsNew) { + postPut(appId, collector); + } + + return collectorInTable; + } I understand this code piece is moved from other place. However, I think it need to be improved: For performance perspective, we should move LOG.info() out of synchronized block (may be move out of collector.start()?). we don't need to LOG.ERROR (replace with INFO?) if collector exists, general semantic for putIfAbsent should allow put the same object in concurrent threads. For remove(), similar that we should move collector.stop() and LOG.info() out of synchronized block.
            sjlee0 Sangjin Lee added a comment - - edited

            Thanks for the review junping_du!

            For performance perspective, we should move LOG.info() out of synchronized block (may be move out of collector.start()?).

            I can move the LOG.info() call outside the synchronized block. That said, I don't think this would have a meaningful performance impact. Aside from the fact that logging calls are usually synchronized themselves, it is reasonable to expect that the contention for this lock (collectors) would be quite low. We're talking about contention when multiple AMs are competing to create collectors on the same node, and the chances that there is any contention on this lock would be very low.

            Also, when you said "may be move out of collector.start()", did you mean moving the collector.start() call outside the synchronization block? If so, I'd be hesitant to do that. We just had a discussion on this in another JIRA (see https://issues.apache.org/jira/browse/YARN-3390?focusedCommentId=14508121&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14508121).

            we don't need to LOG.ERROR (replace with INFO?)

            That is a good suggestion. I'll update this (and remove()) to lower the logging level for this.

            For remove(), similar that we should move collector.stop() and LOG.info() out of synchronized block.

            This we can do safely. I'll update the patch.

            sjlee0 Sangjin Lee added a comment - - edited Thanks for the review junping_du ! For performance perspective, we should move LOG.info() out of synchronized block (may be move out of collector.start()?). I can move the LOG.info() call outside the synchronized block. That said, I don't think this would have a meaningful performance impact. Aside from the fact that logging calls are usually synchronized themselves, it is reasonable to expect that the contention for this lock (collectors) would be quite low. We're talking about contention when multiple AMs are competing to create collectors on the same node, and the chances that there is any contention on this lock would be very low. Also, when you said "may be move out of collector.start()", did you mean moving the collector.start() call outside the synchronization block? If so, I'd be hesitant to do that. We just had a discussion on this in another JIRA (see https://issues.apache.org/jira/browse/YARN-3390?focusedCommentId=14508121&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14508121 ). we don't need to LOG.ERROR (replace with INFO?) That is a good suggestion. I'll update this (and remove()) to lower the logging level for this. For remove(), similar that we should move collector.stop() and LOG.info() out of synchronized block. This we can do safely. I'll update the patch.
            sjlee0 Sangjin Lee added a comment -

            Patch v.4.

            • moved logging statements out of the synchronized blocks
            • dropped logging level from ERROR to INFO
            • reduced the synchronization scope in remove()
            sjlee0 Sangjin Lee added a comment - Patch v.4. moved logging statements out of the synchronized blocks dropped logging level from ERROR to INFO reduced the synchronization scope in remove()
            hadoopqa Hadoop QA added a comment -



            -1 overall



            Vote Subsystem Runtime Comment
            -1 patch 0m 0s The patch command could not apply the patch during dryrun.



            Subsystem Report/Notes
            Patch URL http://issues.apache.org/jira/secure/attachment/12727521/YARN-3437.004.patch
            Optional Tests javac unit findbugs checkstyle javadoc
            git revision trunk / a100be6
            Console output https://builds.apache.org/job/PreCommit-YARN-Build/7466//console

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 patch 0m 0s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12727521/YARN-3437.004.patch Optional Tests javac unit findbugs checkstyle javadoc git revision trunk / a100be6 Console output https://builds.apache.org/job/PreCommit-YARN-Build/7466//console This message was automatically generated.
            junping_du Junping Du added a comment -

            Also, when you said "may be move out of collector.start()", did you mean moving the collector.start() call outside the synchronization block? If so, I'd be hesitant to do that. We just had a discussion on this in another JIRA.

            I see. We don't want the other concurrent thread get a non-start collector. We can improve this later (in some other JIRA), e.g. attach an additional monitor object to specific collector (rather than sync on all collectors), and do something like: startIfNotStarted().

            Another NIT is to replace Collections.synchronizedMap with ConcurrentHashMap, the later one has better performance and better safety (never throw ConcurrentModificationException) but just not keep order of element which we don't need in our case. Given we have YARN-3390 to continue refactor work on the same piece of code, we can continue the discussion there.

            Latest patch LGTM. +1. Will go ahead to commit it within 24h if no object/further comments from others.

            junping_du Junping Du added a comment - Also, when you said "may be move out of collector.start()", did you mean moving the collector.start() call outside the synchronization block? If so, I'd be hesitant to do that. We just had a discussion on this in another JIRA. I see. We don't want the other concurrent thread get a non-start collector. We can improve this later (in some other JIRA), e.g. attach an additional monitor object to specific collector (rather than sync on all collectors), and do something like: startIfNotStarted(). Another NIT is to replace Collections.synchronizedMap with ConcurrentHashMap, the later one has better performance and better safety (never throw ConcurrentModificationException) but just not keep order of element which we don't need in our case. Given we have YARN-3390 to continue refactor work on the same piece of code, we can continue the discussion there. Latest patch LGTM. +1. Will go ahead to commit it within 24h if no object/further comments from others.
            sjlee0 Sangjin Lee added a comment -

            Thanks Junping. I initially went with ConcurrentHashMap when I first created this as that is my preference as well. But it was really preventing multiple threads from starting their collector (should that situation arise) that made ConcurrentHashMap not an option. Again, if we want both, we would need to look at the LoadingCache. But since this is really a low contention situation, it would be an overkill. The chances of this code running into a lock contention should be low.

            sjlee0 Sangjin Lee added a comment - Thanks Junping. I initially went with ConcurrentHashMap when I first created this as that is my preference as well. But it was really preventing multiple threads from starting their collector (should that situation arise) that made ConcurrentHashMap not an option. Again, if we want both, we would need to look at the LoadingCache. But since this is really a low contention situation, it would be an overkill. The chances of this code running into a lock contention should be low.
            zjshen Zhijie Shen added a comment -

            Sorry for the late comments. This patch has half MR code and half YARN code. It's not good to commit it as one patch. I have one thought of managing the commits:

            1. The YARN code is nearly duplicate with YARN-3390. As YARN-3390 is almost ready, we can get that patch in first.

            2. Move this jira to MR project, only retain the MR code in the patch and do some minor rebase according to YARN-3390.

            3. TimelineServicePerformanceTest is in different package and has the different name. Hopefully it won't conflict with YARN-2556. So once YARN-2556 gets committed, we just need to refactor TimelineServicePerformanceTest to reuse YARN-2556 code. BTW, can we put TimelineServicePerformanceTest into the same package of TimelineServicePerformance in YARN-2556, and rename it to TimelineServicePerformanceTestv2?

            How do you think about the plan for the commits?

            W.R.T to the patch, I'm a bit concerned that the write which contains one event per entity is not so typical to represent real use case. And configuration and metrics are even not covered. Is it more realistic to write an entity with 10 events and 10 metrics, which have 100 points in the time series?

            And one nit in the patch: entity.setEntityType("TEZ_DAG_ID");. How about not mentioning TEZ in the MR code?

            zjshen Zhijie Shen added a comment - Sorry for the late comments. This patch has half MR code and half YARN code. It's not good to commit it as one patch. I have one thought of managing the commits: 1. The YARN code is nearly duplicate with YARN-3390 . As YARN-3390 is almost ready, we can get that patch in first. 2. Move this jira to MR project, only retain the MR code in the patch and do some minor rebase according to YARN-3390 . 3. TimelineServicePerformanceTest is in different package and has the different name. Hopefully it won't conflict with YARN-2556 . So once YARN-2556 gets committed, we just need to refactor TimelineServicePerformanceTest to reuse YARN-2556 code. BTW, can we put TimelineServicePerformanceTest into the same package of TimelineServicePerformance in YARN-2556 , and rename it to TimelineServicePerformanceTestv2? How do you think about the plan for the commits? W.R.T to the patch, I'm a bit concerned that the write which contains one event per entity is not so typical to represent real use case. And configuration and metrics are even not covered. Is it more realistic to write an entity with 10 events and 10 metrics, which have 100 points in the time series? And one nit in the patch: entity.setEntityType("TEZ_DAG_ID"); . How about not mentioning TEZ in the MR code?
            sjlee0 Sangjin Lee added a comment -

            Thanks for your comments zjshen.

            1. The YARN code is nearly duplicate with YARN-3390. As YARN-3390 is almost ready, we can get that patch in first.
            2. Move this jira to MR project, only retain the MR code in the patch and do some minor rebase according to YARN-3390.

            Let's do this. While I was working on YARN-3438, I'm realizing that for the performance tests it is probably OK to use the TimelineCollectors directly and bypass the TimelineCollectorManager altogether. If we do that, then this could become purely a MR patch. I'll update this patch to remove the use of TimelineCollectorManager and move this JIRA to MAPREDUCE. How's that sound?

            3. TimelineServicePerformanceTest is in different package and has the different name. Hopefully it won't conflict with YARN-2556. So once YARN-2556 gets committed, we just need to refactor TimelineServicePerformanceTest to reuse YARN-2556 code. BTW, can we put TimelineServicePerformanceTest into the same package of TimelineServicePerformance in YARN-2556, and rename it to TimelineServicePerformanceTestv2?

            That's fine. I'll move it back to the same package.

            W.R.T to the patch, I'm a bit concerned that the write which contains one event per entity is not so typical to represent real use case. And configuration and metrics are even not covered. Is it more realistic to write an entity with 10 events and 10 metrics, which have 100 points in the time series?
            And one nit in the patch: entity.setEntityType("TEZ_DAG_ID");. How about not mentioning TEZ in the MR code?

            Note that this is adding simple entity writes. The more realistic part of the test is coming in YARN-3438 (I'm nearly finished with that), and it will have multiple levels of entities as well as metrics and configuration. We would use that one for more realistic load whereas we could keep this mode as a simpler test. Thoughts?

            I'll change the name of the entity to be something else.

            sjlee0 Sangjin Lee added a comment - Thanks for your comments zjshen . 1. The YARN code is nearly duplicate with YARN-3390 . As YARN-3390 is almost ready, we can get that patch in first. 2. Move this jira to MR project, only retain the MR code in the patch and do some minor rebase according to YARN-3390 . Let's do this. While I was working on YARN-3438 , I'm realizing that for the performance tests it is probably OK to use the TimelineCollectors directly and bypass the TimelineCollectorManager altogether. If we do that, then this could become purely a MR patch. I'll update this patch to remove the use of TimelineCollectorManager and move this JIRA to MAPREDUCE. How's that sound? 3. TimelineServicePerformanceTest is in different package and has the different name. Hopefully it won't conflict with YARN-2556 . So once YARN-2556 gets committed, we just need to refactor TimelineServicePerformanceTest to reuse YARN-2556 code. BTW, can we put TimelineServicePerformanceTest into the same package of TimelineServicePerformance in YARN-2556 , and rename it to TimelineServicePerformanceTestv2? That's fine. I'll move it back to the same package. W.R.T to the patch, I'm a bit concerned that the write which contains one event per entity is not so typical to represent real use case. And configuration and metrics are even not covered. Is it more realistic to write an entity with 10 events and 10 metrics, which have 100 points in the time series? And one nit in the patch: entity.setEntityType("TEZ_DAG_ID");. How about not mentioning TEZ in the MR code? Note that this is adding simple entity writes. The more realistic part of the test is coming in YARN-3438 (I'm nearly finished with that), and it will have multiple levels of entities as well as metrics and configuration. We would use that one for more realistic load whereas we could keep this mode as a simpler test. Thoughts? I'll change the name of the entity to be something else.
            sjlee0 Sangjin Lee added a comment -

            Well it's not entirely true. It seems I still need to change TimelineCollector.getTimelineEntityContext() from protected to public. But creating another YARN JIRA just to make those several lines of changes seems too much. Thoughts folks?

            sjlee0 Sangjin Lee added a comment - Well it's not entirely true. It seems I still need to change TimelineCollector.getTimelineEntityContext() from protected to public. But creating another YARN JIRA just to make those several lines of changes seems too much. Thoughts folks?
            zjshen Zhijie Shen added a comment -

            How's that sound?

            It's also good to me.

            We would use that one for more realistic load whereas we could keep this mode as a simpler test. Thoughts?

            It's okay to make it a simpler case, but could we at least cover one config, and one metric, hence we can verify the db that storing this info also works?

            But creating another YARN JIRA just to make those several lines of changes seems too much

            A couple of lines change in YARN for MR patch is okay.

            zjshen Zhijie Shen added a comment - How's that sound? It's also good to me. We would use that one for more realistic load whereas we could keep this mode as a simpler test. Thoughts? It's okay to make it a simpler case, but could we at least cover one config, and one metric, hence we can verify the db that storing this info also works? But creating another YARN JIRA just to make those several lines of changes seems too much A couple of lines change in YARN for MR patch is okay.
            zjshen Zhijie Shen added a comment -

            Oh, previously I said TimelineServicePerformanceTestv2, but actually I meant TimelineServicePerformanceV2. Just a minor suggestion, and it's up to you to find the suitable class name.

            zjshen Zhijie Shen added a comment - Oh, previously I said TimelineServicePerformanceTestv2, but actually I meant TimelineServicePerformanceV2. Just a minor suggestion, and it's up to you to find the suitable class name.
            hadoopqa Hadoop QA added a comment -



            -1 overall



            Vote Subsystem Runtime Comment
            -1 patch 0m 0s The patch command could not apply the patch during dryrun.



            Subsystem Report/Notes
            Patch URL http://issues.apache.org/jira/secure/attachment/12727521/YARN-3437.004.patch
            Optional Tests javac unit findbugs checkstyle javadoc
            git revision trunk / ef4e996
            Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5438/console

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 patch 0m 0s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12727521/YARN-3437.004.patch Optional Tests javac unit findbugs checkstyle javadoc git revision trunk / ef4e996 Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5438/console This message was automatically generated.
            sjlee0 Sangjin Lee added a comment -

            Patch v.5

            • the test class moved back to org.apache.hadoop.mapred and renamed
            • removed the TimelineCollectorManager changes and used TimelineCollector directly
            • added a metric and a config and renamed the entity
            sjlee0 Sangjin Lee added a comment - Patch v.5 the test class moved back to org.apache.hadoop.mapred and renamed removed the TimelineCollectorManager changes and used TimelineCollector directly added a metric and a config and renamed the entity
            hadoopqa Hadoop QA added a comment -



            -1 overall



            Vote Subsystem Runtime Comment
            -1 patch 0m 0s The patch command could not apply the patch during dryrun.



            Subsystem Report/Notes
            Patch URL http://issues.apache.org/jira/secure/attachment/12727743/MAPREDUCE-6335.005.patch
            Optional Tests javac unit findbugs checkstyle javadoc
            git revision trunk / ef4e996
            Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5439/console

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 patch 0m 0s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12727743/MAPREDUCE-6335.005.patch Optional Tests javac unit findbugs checkstyle javadoc git revision trunk / ef4e996 Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5439/console This message was automatically generated.
            junping_du Junping Du added a comment -

            Move it inside of MAPREDUCE-6331 which is MR umbrella for v2 timeline service.

            junping_du Junping Du added a comment - Move it inside of MAPREDUCE-6331 which is MR umbrella for v2 timeline service.
            zjshen Zhijie Shen added a comment -

            TimelineServicePerformanceV2 looks much better. I still have one question. System.runFinalization(); is borrowed from YARN-2556, but is it any reason why we need to invoke object finalization explicitly here? Will it cause some performance impact?

            BTW, YARN side change is not necessary after YARN-3390.

            zjshen Zhijie Shen added a comment - TimelineServicePerformanceV2 looks much better. I still have one question. System.runFinalization(); is borrowed from YARN-2556 , but is it any reason why we need to invoke object finalization explicitly here? Will it cause some performance impact? BTW, YARN side change is not necessary after YARN-3390 .
            sjlee0 Sangjin Lee added a comment -

            Yeah it's not clear to me why System.runFinalization() is there. I'm going to remove it. I'll also remove the YARN changes. Will update the patch shortly.

            sjlee0 Sangjin Lee added a comment - Yeah it's not clear to me why System.runFinalization() is there. I'm going to remove it. I'll also remove the YARN changes. Will update the patch shortly.
            sjlee0 Sangjin Lee added a comment -

            Patch v.6

            sjlee0 Sangjin Lee added a comment - Patch v.6
            hadoopqa Hadoop QA added a comment -



            -1 overall



            Vote Subsystem Runtime Comment
            0 pre-patch 5m 18s Pre-patch trunk compilation is healthy.
            +1 @author 0m 0s The patch does not contain any @author tags.
            +1 tests included 0m 0s The patch appears to include 2 new or modified test files.
            +1 whitespace 0m 0s The patch has no lines that end in whitespace.
            -1 javac 4m 40s The patch appears to cause the build to fail.



            Subsystem Report/Notes
            Patch URL http://issues.apache.org/jira/secure/attachment/12728615/MAPREDUCE-6335.006.patch
            Optional Tests javac unit findbugs checkstyle
            git revision trunk / feb68cb
            Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5461/console

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 5m 18s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 2 new or modified test files. +1 whitespace 0m 0s The patch has no lines that end in whitespace. -1 javac 4m 40s The patch appears to cause the build to fail. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12728615/MAPREDUCE-6335.006.patch Optional Tests javac unit findbugs checkstyle git revision trunk / feb68cb Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5461/console This message was automatically generated.
            sjlee0 Sangjin Lee added a comment -

            Could you kindly take a look at the latest patch? Thanks.

            sjlee0 Sangjin Lee added a comment - Could you kindly take a look at the latest patch? Thanks.
            zjshen Zhijie Shen added a comment -

            +1 for the last patch. Will commit it if no more comments in the following hours.

            zjshen Zhijie Shen added a comment - +1 for the last patch. Will commit it if no more comments in the following hours.
            zjshen Zhijie Shen added a comment -

            Committed the patch to branch YARN-2928. Thanks for the patch, Sangjin! Thanks for review, Junping!

            zjshen Zhijie Shen added a comment - Committed the patch to branch YARN-2928 . Thanks for the patch, Sangjin! Thanks for review, Junping!
            hudson Hudson added a comment -

            SUCCESS: Integrated in Hadoop-trunk-Commit #10074 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10074/)
            MAPREDUCE-6335. Created MR job based performance test driver for the (sjlee: rev 8c7b6dd2c7ed2528b125e86c0730ce4242ff4ad8)

            • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TimelineServicePerformanceV2.java
            • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/test/MapredTestDriver.java
            hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #10074 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10074/ ) MAPREDUCE-6335 . Created MR job based performance test driver for the (sjlee: rev 8c7b6dd2c7ed2528b125e86c0730ce4242ff4ad8) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TimelineServicePerformanceV2.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/test/MapredTestDriver.java

            People

              sjlee0 Sangjin Lee
              sjlee0 Sangjin Lee
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: