[MAPREDUCE-6335] convert load test driver to timeline service v.2 - ASF JIRA

Sangjin Lee added a comment - 02/Apr/15 21:52

Patch v.1 posted.

This is basically a modification of the ~~YARN-2556~~ patch (and clean-up of issues etc.) to work against the timeline service v.2.

Since the new distributed timeline service collectors are tied to applications, I chose the approach of instantiating the base timeline collector within the mapper task, rather than going through the timeline client. Making it go through the timeline client has a number of challenges (see YARN-3378). But this should be still effective as a way to exercise the bulk of the write performance and scalability.

You can try this out by doing for example

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-SNAPSHOT-tests.jar timelineperformance -m 10 -t 1000

You'll get the output like

TRANSACTION RATE (per mapper): 5027.652086 ops/s
IO RATE (per mapper): 5027.652086 KB/s
TRANSACTION RATE (total): 50276.520865 ops/s
IO RATE (total): 50276.520865 KB/s

It is still using pretty simple entities to write to the storage. I'll work on adding handling job history files later in a different JIRA.

I would greatly appreciate your review. Thanks!

Sangjin Lee added a comment - 02/Apr/15 21:52 Patch v.1 posted. This is basically a modification of the YARN-2556 patch (and clean-up of issues etc.) to work against the timeline service v.2. Since the new distributed timeline service collectors are tied to applications, I chose the approach of instantiating the base timeline collector within the mapper task, rather than going through the timeline client. Making it go through the timeline client has a number of challenges (see YARN-3378 ). But this should be still effective as a way to exercise the bulk of the write performance and scalability. You can try this out by doing for example hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-SNAPSHOT-tests.jar timelineperformance -m 10 -t 1000 You'll get the output like TRANSACTION RATE (per mapper): 5027.652086 ops/s IO RATE (per mapper): 5027.652086 KB/s TRANSACTION RATE (total): 50276.520865 ops/s IO RATE (total): 50276.520865 KB/s It is still using pretty simple entities to write to the storage. I'll work on adding handling job history files later in a different JIRA. I would greatly appreciate your review. Thanks!

Sangjin Lee added a comment - 02/Apr/15 21:56

Added a few folks for review.

Sangjin Lee added a comment - 02/Apr/15 21:56 Added a few folks for review.

Hadoop QA added a comment - 02/Apr/15 22:04

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12709078/YARN-3437.001.patch
against trunk revision 6a6a59d.

-1 patch. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7206//console

This message is automatically generated.

Hadoop QA added a comment - 02/Apr/15 22:04 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12709078/YARN-3437.001.patch against trunk revision 6a6a59d. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7206//console This message is automatically generated.

Junping Du added a comment - 03/Apr/15 13:44

Thanks sjlee0 for delivering a patch here!
Just quickly go through the patch, looks like we are generating one app collector per map task. I think this is good for scalability test on backend storage which can be a bottleneck in mainstream cases. In addition, do we want to address some extreme cases, e.g. a huge applications will have hundreds of thousands or even millions tasks? If so, then may be we want to know a single app collector's bottleneck as well for accepting/forwarding messages from hundreds of thousands maps. Also, in a real cluster, the mapping from cluster to app, and app to tasks are all 1-N mapping. May be making app aggregator number configurable (just like map task number, and byte per map, etc.) is something we can do for next step?
BTW, it has some duplicated code with ~~YARN-2556~~ (like TimelineServerPerformance.java). Looks like ~~YARN-2556~~ is in pretty good shape and possible to go to trunk and branch-2 quickly. I would remind to keep watching that JIRA status and do necessary rebase work if that patch go in and we may want to merge it into ~~YARN-2928~~ branch soon.

Junping Du added a comment - 03/Apr/15 13:44 Thanks sjlee0 for delivering a patch here! Just quickly go through the patch, looks like we are generating one app collector per map task. I think this is good for scalability test on backend storage which can be a bottleneck in mainstream cases. In addition, do we want to address some extreme cases, e.g. a huge applications will have hundreds of thousands or even millions tasks? If so, then may be we want to know a single app collector's bottleneck as well for accepting/forwarding messages from hundreds of thousands maps. Also, in a real cluster, the mapping from cluster to app, and app to tasks are all 1-N mapping. May be making app aggregator number configurable (just like map task number, and byte per map, etc.) is something we can do for next step? BTW, it has some duplicated code with YARN-2556 (like TimelineServerPerformance.java). Looks like YARN-2556 is in pretty good shape and possible to go to trunk and branch-2 quickly. I would remind to keep watching that JIRA status and do necessary rebase work if that patch go in and we may want to merge it into YARN-2928 branch soon.

Sangjin Lee added a comment - 14/Apr/15 18:12 - edited

Thanks junping_du for your comments!

I agree with most of your comments. This particular subtask generates a pretty simple model (just a straightforward conversion of ~~YARN-2556~~) where each map task creates one app collector and simple entities. But we will add more test modes that will have different characteristics. For one, we will add a mode where it iterates over MR job history files and pump them into the timeline storage.

The number of mappers is a pretty useful way of controlling the parallelism of tests, and we can add more parameters to have finer-grained parallelism controls (one app collector manager having multiple app collectors, etc.).

BTW, it has some duplicated code with ~~YARN-2556~~ (like TimelineServerPerformance.java).

That's because I copied it (as mentioned in the description). I had to fix a few issues with the original patch there to get it going. Some appear to be unrelated to porting it to the timeline service v.2. I'll add some comments there to help it get committed.

Sangjin Lee added a comment - 14/Apr/15 18:12 - edited Thanks junping_du for your comments! I agree with most of your comments. This particular subtask generates a pretty simple model (just a straightforward conversion of YARN-2556 ) where each map task creates one app collector and simple entities. But we will add more test modes that will have different characteristics. For one, we will add a mode where it iterates over MR job history files and pump them into the timeline storage. The number of mappers is a pretty useful way of controlling the parallelism of tests, and we can add more parameters to have finer-grained parallelism controls (one app collector manager having multiple app collectors, etc.). BTW, it has some duplicated code with YARN-2556 (like TimelineServerPerformance.java). That's because I copied it (as mentioned in the description). I had to fix a few issues with the original patch there to get it going. Some appear to be unrelated to porting it to the timeline service v.2. I'll add some comments there to help it get committed.

Sangjin Lee added a comment - 14/Apr/15 18:15

Forgot to add that I copied it to get the load testing going. Once ~~YARN-2556~~ gets committed on trunk, I'd need to update this to handle both the old timeline service and the new timeline service.

Sangjin Lee added a comment - 14/Apr/15 18:15 Forgot to add that I copied it to get the load testing going. Once YARN-2556 gets committed on trunk, I'd need to update this to handle both the old timeline service and the new timeline service.

Sangjin Lee added a comment - 16/Apr/15 01:25

Rebased the patch with the latest from the ~~YARN-2928~~ branch.

Sangjin Lee added a comment - 16/Apr/15 01:25 Rebased the patch with the latest from the YARN-2928 branch.

Hadoop QA added a comment - 16/Apr/15 01:30

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12725758/YARN-3437.002.patch
against trunk revision 1b89a3e.

-1 patch. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7354//console

This message is automatically generated.

Hadoop QA added a comment - 16/Apr/15 01:30 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725758/YARN-3437.002.patch against trunk revision 1b89a3e. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7354//console This message is automatically generated.

Zhijie Shen added a comment - 17/Apr/15 04:51

Per my comment on YARN-3390 . Please feel free to move forward.

Zhijie Shen added a comment - 17/Apr/15 04:51 Per my comment on YARN-3390 . Please feel free to move forward.

Sangjin Lee added a comment - 17/Apr/15 16:02

Thanks zjshen! Could someone please commit this patch unless there are further comments?

Sangjin Lee added a comment - 17/Apr/15 16:02 Thanks zjshen ! Could someone please commit this patch unless there are further comments?

Zhijie Shen added a comment - 17/Apr/15 16:13

Will take a look today.

Zhijie Shen added a comment - 17/Apr/15 16:13 Will take a look today.

Jonathan Turner Eagles added a comment - 17/Apr/15 16:57

Now that I have dug into timeline server performance (~~YARN-3448~~). I have a better understanding of what type of writes are costly. For example, a single entity will generate dozens or writes to the database. The number of primary keys, the number of related entities, and the write batch size (entities per put) greatly affect the time an entity put takes. While this is a good start, I think there should at least be a follow up that addresses these issues to better measure the write performance.

Jonathan Turner Eagles added a comment - 17/Apr/15 16:57 Now that I have dug into timeline server performance ( YARN-3448 ). I have a better understanding of what type of writes are costly. For example, a single entity will generate dozens or writes to the database. The number of primary keys, the number of related entities, and the write batch size (entities per put) greatly affect the time an entity put takes. While this is a good start, I think there should at least be a follow up that addresses these issues to better measure the write performance.

Sangjin Lee added a comment - 20/Apr/15 18:39

jeagles, thanks for your comment.

Yes, I agree the level of reporting in this patch is mostly basic and very high level. I definitely agree that we need to add more reporting in terms of the actual number of writes/puts done on the storage, etc. The first use of this tool is to compare the overall throughput/performance of the storage implementations and identify high level issues. I suspect for that the coarse-grained reporting might be enough.

As a follow-up JIRA, I'll add a new JIRA that reports more fine-grained write performance data.

Sangjin Lee added a comment - 20/Apr/15 18:39 jeagles , thanks for your comment. Yes, I agree the level of reporting in this patch is mostly basic and very high level. I definitely agree that we need to add more reporting in terms of the actual number of writes/puts done on the storage, etc. The first use of this tool is to compare the overall throughput/performance of the storage implementations and identify high level issues. I suspect for that the coarse-grained reporting might be enough. As a follow-up JIRA, I'll add a new JIRA that reports more fine-grained write performance data.

Sangjin Lee added a comment - 20/Apr/15 18:42

Added YARN-3512.

Sangjin Lee added a comment - 20/Apr/15 18:42 Added YARN-3512 .

Junping Du added a comment - 21/Apr/15 12:03

Filed a new JIRA to track fine-grained performance data sounds good to me. Given the patch here have duplicated code with ~~YARN-2556~~, I would like to understand what's our plan for ~~YARN-2556~~. jeagles, can you share your vision on this?
Looks like this JIRA block ~~YARN-3390~~ (a refactor JIRA) which block ~~YARN-3044~~ (RM writing events to v2 ATS service). I would like to have a clear path to make all patches goes in as a pipeline with getting ride of any potential deadlock.
May be the first step is to make ~~YARN-2556~~ get committed it, and get patch here rebased? jeagles, sjlee0 and zjshen, what's your opinion on this?

Junping Du added a comment - 21/Apr/15 12:03 Filed a new JIRA to track fine-grained performance data sounds good to me. Given the patch here have duplicated code with YARN-2556 , I would like to understand what's our plan for YARN-2556 . jeagles , can you share your vision on this? Looks like this JIRA block YARN-3390 (a refactor JIRA) which block YARN-3044 (RM writing events to v2 ATS service). I would like to have a clear path to make all patches goes in as a pipeline with getting ride of any potential deadlock. May be the first step is to make YARN-2556 get committed it, and get patch here rebased? jeagles , sjlee0 and zjshen , what's your opinion on this?

Sangjin Lee added a comment - 21/Apr/15 17:58

I think we need to make progress on this as this is blocking other JIRAs and also it's tied to the schema evaluation.

My vote is to get this committed, and adjust this once ~~YARN-2556~~ lands and we rebase. Thoughts?

Sangjin Lee added a comment - 21/Apr/15 17:58 I think we need to make progress on this as this is blocking other JIRAs and also it's tied to the schema evaluation. My vote is to get this committed, and adjust this once YARN-2556 lands and we rebase. Thoughts?

Junping Du added a comment - 21/Apr/15 19:41

I agree too if we don't have clear plan for ~~YARN-2556~~ so far. There should be no reason to block other going efforts.
An suggestion (optional only) is: can we adjust name (or package path) slightly for duplicated file (TimelineServerPerformance.java) with ~~YARN-2556~~? We can have an additional patch to remove duplicated file when ~~YARN-2556~~ get in trunk.
I assume this could be easier for ~~YARN-2928~~ rebase back to trunk/branch-2 as less conflict. Thoughts?

Junping Du added a comment - 21/Apr/15 19:41 I agree too if we don't have clear plan for YARN-2556 so far. There should be no reason to block other going efforts. An suggestion (optional only) is: can we adjust name (or package path) slightly for duplicated file (TimelineServerPerformance.java) with YARN-2556 ? We can have an additional patch to remove duplicated file when YARN-2556 get in trunk. I assume this could be easier for YARN-2928 rebase back to trunk/branch-2 as less conflict. Thoughts?

Sangjin Lee added a comment - 22/Apr/15 00:52

The downside is code duplication, but as you said we would not be impacted when/if ~~YARN-2556~~ lands. If others are comfortable with that, I could quickly move it to a different package. Let me know soon, and I'll update the patch.

Sangjin Lee added a comment - 22/Apr/15 00:52 The downside is code duplication, but as you said we would not be impacted when/if YARN-2556 lands. If others are comfortable with that, I could quickly move it to a different package. Let me know soon, and I'll update the patch.

Junping Du added a comment - 22/Apr/15 10:28

The downside is code duplication.

Agree. However, I think this is the least price we could pay when developing on different branches in parallel.

If others are comfortable with that, I could quickly move it to a different package.

I am OK with this in case we have additional JIRA to track removing duplicated code after ~~YARN-2556~~ land. We can also add some credit to contributors of ~~YARN-2556~~ in commit/CHANGE messages which should be very common.

zjshen and jeagles, any ideas on this?

Junping Du added a comment - 22/Apr/15 10:28 The downside is code duplication. Agree. However, I think this is the least price we could pay when developing on different branches in parallel. If others are comfortable with that, I could quickly move it to a different package. I am OK with this in case we have additional JIRA to track removing duplicated code after YARN-2556 land. We can also add some credit to contributors of YARN-2556 in commit/CHANGE messages which should be very common. zjshen and jeagles , any ideas on this?

Sangjin Lee added a comment - 22/Apr/15 21:29

Patch v.3. It's just a rename of the test class (with a couple of constants renamed accordingly) compared to patch v.2.

Sangjin Lee added a comment - 22/Apr/15 21:29 Patch v.3. It's just a rename of the test class (with a couple of constants renamed accordingly) compared to patch v.2.

Hadoop QA added a comment - 22/Apr/15 21:30

-1 overall

Vote	Subsystem	Runtime	Comment
-1	patch	0m 0s	The patch command could not apply the patch during dryrun.

Subsystem	Report/Notes
Patch URL	http://issues.apache.org/jira/secure/attachment/12727346/YARN-3437.003.patch
Optional Tests	javac unit findbugs checkstyle javadoc
git revision	trunk / 0ebe84d
Console output	https://builds.apache.org/job/PreCommit-YARN-Build/7459//console

This message was automatically generated.

Hadoop QA added a comment - 22/Apr/15 21:30 -1 overall Vote Subsystem Runtime Comment -1 patch 0m 0s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12727346/YARN-3437.003.patch Optional Tests javac unit findbugs checkstyle javadoc git revision trunk / 0ebe84d Console output https://builds.apache.org/job/PreCommit-YARN-Build/7459//console This message was automatically generated.

Sangjin Lee added a comment - 22/Apr/15 23:29

Could you kindly take a look at the latest patch? Thanks!

Sangjin Lee added a comment - 22/Apr/15 23:29 Could you kindly take a look at the latest patch? Thanks!

Junping Du added a comment - 23/Apr/15 00:19

Thanks sjlee0 for updating the patch!
The latest patch looks good in overall, some minor comments:

+  public TimelineCollector putIfAbsent(ApplicationId appId,
+      TimelineCollector collector) {
+    String id = appId.toString();
+    TimelineCollector collectorInTable;
+    boolean collectorIsNew = false;
+    synchronized (collectors) {
+      collectorInTable = collectors.get(id);
+      if (collectorInTable == null) {
+        try {
+          // initialize, start, and add it to the collection so it can be
+          // cleaned up when the parent shuts down
+          collector.init(getConfig());
+          collector.start();
+          collectors.put(id, collector);
+          LOG.info("the collector for " + id + " was added");
+          collectorInTable = collector;
+          collectorIsNew = true;
+        } catch (Exception e) {
+          throw new YarnRuntimeException(e);
+        }
+      } else {
+        String msg = "the collector for " + id + " already exists!";
+        LOG.error(msg);
+      }
+    }
+
+    if (collectorIsNew) {
+      postPut(appId, collector);
+    }
+
+    return collectorInTable;
+  }

I understand this code piece is moved from other place. However, I think it need to be improved:

For performance perspective, we should move LOG.info() out of synchronized block (may be move out of collector.start()?).
we don't need to LOG.ERROR (replace with INFO?) if collector exists, general semantic for putIfAbsent should allow put the same object in concurrent threads.

For remove(), similar that we should move collector.stop() and LOG.info() out of synchronized block.

Junping Du added a comment - 23/Apr/15 00:19 Thanks sjlee0 for updating the patch! The latest patch looks good in overall, some minor comments: + public TimelineCollector putIfAbsent(ApplicationId appId, + TimelineCollector collector) { + String id = appId.toString(); + TimelineCollector collectorInTable; + boolean collectorIsNew = false ; + synchronized (collectors) { + collectorInTable = collectors.get(id); + if (collectorInTable == null ) { + try { + // initialize, start, and add it to the collection so it can be + // cleaned up when the parent shuts down + collector.init(getConfig()); + collector.start(); + collectors.put(id, collector); + LOG.info( "the collector for " + id + " was added" ); + collectorInTable = collector; + collectorIsNew = true ; + } catch (Exception e) { + throw new YarnRuntimeException(e); + } + } else { + String msg = "the collector for " + id + " already exists!" ; + LOG.error(msg); + } + } + + if (collectorIsNew) { + postPut(appId, collector); + } + + return collectorInTable; + } I understand this code piece is moved from other place. However, I think it need to be improved: For performance perspective, we should move LOG.info() out of synchronized block (may be move out of collector.start()?). we don't need to LOG.ERROR (replace with INFO?) if collector exists, general semantic for putIfAbsent should allow put the same object in concurrent threads. For remove(), similar that we should move collector.stop() and LOG.info() out of synchronized block.

Sangjin Lee added a comment - 23/Apr/15 03:22 - edited

Thanks for the review junping_du!

For performance perspective, we should move LOG.info() out of synchronized block (may be move out of collector.start()?).

I can move the LOG.info() call outside the synchronized block. That said, I don't think this would have a meaningful performance impact. Aside from the fact that logging calls are usually synchronized themselves, it is reasonable to expect that the contention for this lock (collectors) would be quite low. We're talking about contention when multiple AMs are competing to create collectors on the same node, and the chances that there is any contention on this lock would be very low.

Also, when you said "may be move out of collector.start()", did you mean moving the collector.start() call outside the synchronization block? If so, I'd be hesitant to do that. We just had a discussion on this in another JIRA (see https://issues.apache.org/jira/browse/YARN-3390?focusedCommentId=14508121&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14508121).

we don't need to LOG.ERROR (replace with INFO?)

That is a good suggestion. I'll update this (and remove()) to lower the logging level for this.

For remove(), similar that we should move collector.stop() and LOG.info() out of synchronized block.

This we can do safely. I'll update the patch.

Sangjin Lee added a comment - 23/Apr/15 03:22 - edited Thanks for the review junping_du ! For performance perspective, we should move LOG.info() out of synchronized block (may be move out of collector.start()?). I can move the LOG.info() call outside the synchronized block. That said, I don't think this would have a meaningful performance impact. Aside from the fact that logging calls are usually synchronized themselves, it is reasonable to expect that the contention for this lock (collectors) would be quite low. We're talking about contention when multiple AMs are competing to create collectors on the same node, and the chances that there is any contention on this lock would be very low. Also, when you said "may be move out of collector.start()", did you mean moving the collector.start() call outside the synchronization block? If so, I'd be hesitant to do that. We just had a discussion on this in another JIRA (see https://issues.apache.org/jira/browse/YARN-3390?focusedCommentId=14508121&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14508121 ). we don't need to LOG.ERROR (replace with INFO?) That is a good suggestion. I'll update this (and remove()) to lower the logging level for this. For remove(), similar that we should move collector.stop() and LOG.info() out of synchronized block. This we can do safely. I'll update the patch.

Sangjin Lee added a comment - 23/Apr/15 03:34

Patch v.4.

moved logging statements out of the synchronized blocks
dropped logging level from ERROR to INFO
reduced the synchronization scope in remove()

Sangjin Lee added a comment - 23/Apr/15 03:34 Patch v.4. moved logging statements out of the synchronized blocks dropped logging level from ERROR to INFO reduced the synchronization scope in remove()

Hadoop QA added a comment - 23/Apr/15 03:40

-1 overall

Vote	Subsystem	Runtime	Comment
-1	patch	0m 0s	The patch command could not apply the patch during dryrun.

Subsystem	Report/Notes
Patch URL	http://issues.apache.org/jira/secure/attachment/12727521/YARN-3437.004.patch
Optional Tests	javac unit findbugs checkstyle javadoc
git revision	trunk / a100be6
Console output	https://builds.apache.org/job/PreCommit-YARN-Build/7466//console

This message was automatically generated.

Hadoop QA added a comment - 23/Apr/15 03:40 -1 overall Vote Subsystem Runtime Comment -1 patch 0m 0s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12727521/YARN-3437.004.patch Optional Tests javac unit findbugs checkstyle javadoc git revision trunk / a100be6 Console output https://builds.apache.org/job/PreCommit-YARN-Build/7466//console This message was automatically generated.

Junping Du added a comment - 23/Apr/15 09:36

Also, when you said "may be move out of collector.start()", did you mean moving the collector.start() call outside the synchronization block? If so, I'd be hesitant to do that. We just had a discussion on this in another JIRA.

I see. We don't want the other concurrent thread get a non-start collector. We can improve this later (in some other JIRA), e.g. attach an additional monitor object to specific collector (rather than sync on all collectors), and do something like: startIfNotStarted().

Another NIT is to replace Collections.synchronizedMap with ConcurrentHashMap, the later one has better performance and better safety (never throw ConcurrentModificationException) but just not keep order of element which we don't need in our case. Given we have ~~YARN-3390~~ to continue refactor work on the same piece of code, we can continue the discussion there.

Latest patch LGTM. +1. Will go ahead to commit it within 24h if no object/further comments from others.

Junping Du added a comment - 23/Apr/15 09:36 Also, when you said "may be move out of collector.start()", did you mean moving the collector.start() call outside the synchronization block? If so, I'd be hesitant to do that. We just had a discussion on this in another JIRA. I see. We don't want the other concurrent thread get a non-start collector. We can improve this later (in some other JIRA), e.g. attach an additional monitor object to specific collector (rather than sync on all collectors), and do something like: startIfNotStarted(). Another NIT is to replace Collections.synchronizedMap with ConcurrentHashMap, the later one has better performance and better safety (never throw ConcurrentModificationException) but just not keep order of element which we don't need in our case. Given we have YARN-3390 to continue refactor work on the same piece of code, we can continue the discussion there. Latest patch LGTM. +1. Will go ahead to commit it within 24h if no object/further comments from others.

Sangjin Lee added a comment - 23/Apr/15 15:49

Thanks Junping. I initially went with ConcurrentHashMap when I first created this as that is my preference as well. But it was really preventing multiple threads from starting their collector (should that situation arise) that made ConcurrentHashMap not an option. Again, if we want both, we would need to look at the LoadingCache. But since this is really a low contention situation, it would be an overkill. The chances of this code running into a lock contention should be low.

Sangjin Lee added a comment - 23/Apr/15 15:49 Thanks Junping. I initially went with ConcurrentHashMap when I first created this as that is my preference as well. But it was really preventing multiple threads from starting their collector (should that situation arise) that made ConcurrentHashMap not an option. Again, if we want both, we would need to look at the LoadingCache. But since this is really a low contention situation, it would be an overkill. The chances of this code running into a lock contention should be low.

Zhijie Shen added a comment - 23/Apr/15 20:26

Sorry for the late comments. This patch has half MR code and half YARN code. It's not good to commit it as one patch. I have one thought of managing the commits:

1. The YARN code is nearly duplicate with ~~YARN-3390~~. As ~~YARN-3390~~ is almost ready, we can get that patch in first.

2. Move this jira to MR project, only retain the MR code in the patch and do some minor rebase according to ~~YARN-3390~~.

3. TimelineServicePerformanceTest is in different package and has the different name. Hopefully it won't conflict with ~~YARN-2556~~. So once ~~YARN-2556~~ gets committed, we just need to refactor TimelineServicePerformanceTest to reuse ~~YARN-2556~~ code. BTW, can we put TimelineServicePerformanceTest into the same package of TimelineServicePerformance in ~~YARN-2556~~, and rename it to TimelineServicePerformanceTestv2?

How do you think about the plan for the commits?

W.R.T to the patch, I'm a bit concerned that the write which contains one event per entity is not so typical to represent real use case. And configuration and metrics are even not covered. Is it more realistic to write an entity with 10 events and 10 metrics, which have 100 points in the time series?

And one nit in the patch: entity.setEntityType("TEZ_DAG_ID");. How about not mentioning TEZ in the MR code?

Zhijie Shen added a comment - 23/Apr/15 20:26 Sorry for the late comments. This patch has half MR code and half YARN code. It's not good to commit it as one patch. I have one thought of managing the commits: 1. The YARN code is nearly duplicate with YARN-3390 . As YARN-3390 is almost ready, we can get that patch in first. 2. Move this jira to MR project, only retain the MR code in the patch and do some minor rebase according to YARN-3390 . 3. TimelineServicePerformanceTest is in different package and has the different name. Hopefully it won't conflict with YARN-2556 . So once YARN-2556 gets committed, we just need to refactor TimelineServicePerformanceTest to reuse YARN-2556 code. BTW, can we put TimelineServicePerformanceTest into the same package of TimelineServicePerformance in YARN-2556 , and rename it to TimelineServicePerformanceTestv2? How do you think about the plan for the commits? W.R.T to the patch, I'm a bit concerned that the write which contains one event per entity is not so typical to represent real use case. And configuration and metrics are even not covered. Is it more realistic to write an entity with 10 events and 10 metrics, which have 100 points in the time series? And one nit in the patch: entity.setEntityType("TEZ_DAG_ID"); . How about not mentioning TEZ in the MR code?

Sangjin Lee added a comment - 23/Apr/15 21:33

Thanks for your comments zjshen.

1. The YARN code is nearly duplicate with ~~YARN-3390~~. As ~~YARN-3390~~ is almost ready, we can get that patch in first.
2. Move this jira to MR project, only retain the MR code in the patch and do some minor rebase according to ~~YARN-3390~~.

Let's do this. While I was working on ~~YARN-3438~~, I'm realizing that for the performance tests it is probably OK to use the TimelineCollectors directly and bypass the TimelineCollectorManager altogether. If we do that, then this could become purely a MR patch. I'll update this patch to remove the use of TimelineCollectorManager and move this JIRA to MAPREDUCE. How's that sound?

3. TimelineServicePerformanceTest is in different package and has the different name. Hopefully it won't conflict with ~~YARN-2556~~. So once ~~YARN-2556~~ gets committed, we just need to refactor TimelineServicePerformanceTest to reuse ~~YARN-2556~~ code. BTW, can we put TimelineServicePerformanceTest into the same package of TimelineServicePerformance in ~~YARN-2556~~, and rename it to TimelineServicePerformanceTestv2?

That's fine. I'll move it back to the same package.

W.R.T to the patch, I'm a bit concerned that the write which contains one event per entity is not so typical to represent real use case. And configuration and metrics are even not covered. Is it more realistic to write an entity with 10 events and 10 metrics, which have 100 points in the time series?
And one nit in the patch: entity.setEntityType("TEZ_DAG_ID");. How about not mentioning TEZ in the MR code?

Note that this is adding simple entity writes. The more realistic part of the test is coming in ~~YARN-3438~~ (I'm nearly finished with that), and it will have multiple levels of entities as well as metrics and configuration. We would use that one for more realistic load whereas we could keep this mode as a simpler test. Thoughts?

I'll change the name of the entity to be something else.

Sangjin Lee added a comment - 23/Apr/15 21:33 Thanks for your comments zjshen . 1. The YARN code is nearly duplicate with YARN-3390 . As YARN-3390 is almost ready, we can get that patch in first. 2. Move this jira to MR project, only retain the MR code in the patch and do some minor rebase according to YARN-3390 . Let's do this. While I was working on YARN-3438 , I'm realizing that for the performance tests it is probably OK to use the TimelineCollectors directly and bypass the TimelineCollectorManager altogether. If we do that, then this could become purely a MR patch. I'll update this patch to remove the use of TimelineCollectorManager and move this JIRA to MAPREDUCE. How's that sound? 3. TimelineServicePerformanceTest is in different package and has the different name. Hopefully it won't conflict with YARN-2556 . So once YARN-2556 gets committed, we just need to refactor TimelineServicePerformanceTest to reuse YARN-2556 code. BTW, can we put TimelineServicePerformanceTest into the same package of TimelineServicePerformance in YARN-2556 , and rename it to TimelineServicePerformanceTestv2? That's fine. I'll move it back to the same package. W.R.T to the patch, I'm a bit concerned that the write which contains one event per entity is not so typical to represent real use case. And configuration and metrics are even not covered. Is it more realistic to write an entity with 10 events and 10 metrics, which have 100 points in the time series? And one nit in the patch: entity.setEntityType("TEZ_DAG_ID");. How about not mentioning TEZ in the MR code? Note that this is adding simple entity writes. The more realistic part of the test is coming in YARN-3438 (I'm nearly finished with that), and it will have multiple levels of entities as well as metrics and configuration. We would use that one for more realistic load whereas we could keep this mode as a simpler test. Thoughts? I'll change the name of the entity to be something else.

Sangjin Lee added a comment - 23/Apr/15 21:55

Well it's not entirely true. It seems I still need to change TimelineCollector.getTimelineEntityContext() from protected to public. But creating another YARN JIRA just to make those several lines of changes seems too much. Thoughts folks?

Sangjin Lee added a comment - 23/Apr/15 21:55 Well it's not entirely true. It seems I still need to change TimelineCollector.getTimelineEntityContext() from protected to public. But creating another YARN JIRA just to make those several lines of changes seems too much. Thoughts folks?

Zhijie Shen added a comment - 23/Apr/15 22:08

How's that sound?

It's also good to me.

We would use that one for more realistic load whereas we could keep this mode as a simpler test. Thoughts?

It's okay to make it a simpler case, but could we at least cover one config, and one metric, hence we can verify the db that storing this info also works?

But creating another YARN JIRA just to make those several lines of changes seems too much

A couple of lines change in YARN for MR patch is okay.

Zhijie Shen added a comment - 23/Apr/15 22:08 How's that sound? It's also good to me. We would use that one for more realistic load whereas we could keep this mode as a simpler test. Thoughts? It's okay to make it a simpler case, but could we at least cover one config, and one metric, hence we can verify the db that storing this info also works? But creating another YARN JIRA just to make those several lines of changes seems too much A couple of lines change in YARN for MR patch is okay.

Zhijie Shen added a comment - 23/Apr/15 22:10

Oh, previously I said TimelineServicePerformanceTestv2, but actually I meant TimelineServicePerformanceV2. Just a minor suggestion, and it's up to you to find the suitable class name.

Zhijie Shen added a comment - 23/Apr/15 22:10 Oh, previously I said TimelineServicePerformanceTestv2, but actually I meant TimelineServicePerformanceV2. Just a minor suggestion, and it's up to you to find the suitable class name.

Hadoop QA added a comment - 23/Apr/15 22:50

-1 overall

Vote	Subsystem	Runtime	Comment
-1	patch	0m 0s	The patch command could not apply the patch during dryrun.

Subsystem	Report/Notes
Patch URL	http://issues.apache.org/jira/secure/attachment/12727521/YARN-3437.004.patch
Optional Tests	javac unit findbugs checkstyle javadoc
git revision	trunk / ef4e996
Console output	https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5438/console

This message was automatically generated.

Hadoop QA added a comment - 23/Apr/15 22:50 -1 overall Vote Subsystem Runtime Comment -1 patch 0m 0s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12727521/YARN-3437.004.patch Optional Tests javac unit findbugs checkstyle javadoc git revision trunk / ef4e996 Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5438/console This message was automatically generated.

Sangjin Lee added a comment - 23/Apr/15 22:51

Patch v.5

the test class moved back to org.apache.hadoop.mapred and renamed
removed the TimelineCollectorManager changes and used TimelineCollector directly
added a metric and a config and renamed the entity

Sangjin Lee added a comment - 23/Apr/15 22:51 Patch v.5 the test class moved back to org.apache.hadoop.mapred and renamed removed the TimelineCollectorManager changes and used TimelineCollector directly added a metric and a config and renamed the entity

Hadoop QA added a comment - 23/Apr/15 23:00

-1 overall

Vote	Subsystem	Runtime	Comment
-1	patch	0m 0s	The patch command could not apply the patch during dryrun.

Subsystem	Report/Notes
Patch URL	http://issues.apache.org/jira/secure/attachment/12727743/MAPREDUCE-6335.005.patch
Optional Tests	javac unit findbugs checkstyle javadoc
git revision	trunk / ef4e996
Console output	https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5439/console

This message was automatically generated.

Hadoop QA added a comment - 23/Apr/15 23:00 -1 overall Vote Subsystem Runtime Comment -1 patch 0m 0s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12727743/MAPREDUCE-6335.005.patch Optional Tests javac unit findbugs checkstyle javadoc git revision trunk / ef4e996 Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5439/console This message was automatically generated.

Junping Du added a comment - 23/Apr/15 23:26

Move it inside of ~~MAPREDUCE-6331~~ which is MR umbrella for v2 timeline service.

Junping Du added a comment - 23/Apr/15 23:26 Move it inside of MAPREDUCE-6331 which is MR umbrella for v2 timeline service.

Zhijie Shen added a comment - 27/Apr/15 22:32

TimelineServicePerformanceV2 looks much better. I still have one question. System.runFinalization(); is borrowed from ~~YARN-2556~~, but is it any reason why we need to invoke object finalization explicitly here? Will it cause some performance impact?

BTW, YARN side change is not necessary after ~~YARN-3390~~.

Zhijie Shen added a comment - 27/Apr/15 22:32 TimelineServicePerformanceV2 looks much better. I still have one question. System.runFinalization(); is borrowed from YARN-2556 , but is it any reason why we need to invoke object finalization explicitly here? Will it cause some performance impact? BTW, YARN side change is not necessary after YARN-3390 .

Sangjin Lee added a comment - 27/Apr/15 23:34

Yeah it's not clear to me why System.runFinalization() is there. I'm going to remove it. I'll also remove the YARN changes. Will update the patch shortly.

Sangjin Lee added a comment - 27/Apr/15 23:34 Yeah it's not clear to me why System.runFinalization() is there. I'm going to remove it. I'll also remove the YARN changes. Will update the patch shortly.

Sangjin Lee added a comment - 27/Apr/15 23:51

Patch v.6

Sangjin Lee added a comment - 27/Apr/15 23:51 Patch v.6

Hadoop QA added a comment - 28/Apr/15 00:23

-1 overall

Vote	Subsystem	Runtime	Comment
0	pre-patch	5m 18s	Pre-patch trunk compilation is healthy.
+1	@author	0m 0s	The patch does not contain any @author tags.
+1	tests included	0m 0s	The patch appears to include 2 new or modified test files.
+1	whitespace	0m 0s	The patch has no lines that end in whitespace.
-1	javac	4m 40s	The patch appears to cause the build to fail.

Subsystem	Report/Notes
Patch URL	http://issues.apache.org/jira/secure/attachment/12728615/MAPREDUCE-6335.006.patch
Optional Tests	javac unit findbugs checkstyle
git revision	trunk / feb68cb
Console output	https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5461/console

This message was automatically generated.

Hadoop QA added a comment - 28/Apr/15 00:23 -1 overall Vote Subsystem Runtime Comment 0 pre-patch 5m 18s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 2 new or modified test files. +1 whitespace 0m 0s The patch has no lines that end in whitespace. -1 javac 4m 40s The patch appears to cause the build to fail. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12728615/MAPREDUCE-6335.006.patch Optional Tests javac unit findbugs checkstyle git revision trunk / feb68cb Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5461/console This message was automatically generated.

Sangjin Lee added a comment - 28/Apr/15 20:30

Could you kindly take a look at the latest patch? Thanks.

Sangjin Lee added a comment - 28/Apr/15 20:30 Could you kindly take a look at the latest patch? Thanks.

Zhijie Shen added a comment - 28/Apr/15 20:55

+1 for the last patch. Will commit it if no more comments in the following hours.

Zhijie Shen added a comment - 28/Apr/15 20:55 +1 for the last patch. Will commit it if no more comments in the following hours.

Zhijie Shen added a comment - 29/Apr/15 02:48

Committed the patch to branch ~~YARN-2928~~. Thanks for the patch, Sangjin! Thanks for review, Junping!

Zhijie Shen added a comment - 29/Apr/15 02:48 Committed the patch to branch YARN-2928 . Thanks for the patch, Sangjin! Thanks for review, Junping!

Hudson added a comment - 10/Jul/16 16:24

SUCCESS: Integrated in Hadoop-trunk-Commit #10074 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10074/)
~~MAPREDUCE-6335~~. Created MR job based performance test driver for the (sjlee: rev 8c7b6dd2c7ed2528b125e86c0730ce4242ff4ad8)

hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TimelineServicePerformanceV2.java
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/test/MapredTestDriver.java

Hudson added a comment - 10/Jul/16 16:24 SUCCESS: Integrated in Hadoop-trunk-Commit #10074 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10074/ ) MAPREDUCE-6335 . Created MR job based performance test driver for the (sjlee: rev 8c7b6dd2c7ed2528b125e86c0730ce4242ff4ad8) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TimelineServicePerformanceV2.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/test/MapredTestDriver.java

Hadoop Map/Reduce

convert load test driver to timeline service v.2

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates