Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
Reviewed
Description
This subtask covers the work for converting the proposed patch for the load test driver (YARN-2556) to work with the timeline service v.2.
Attachments
Attachments
- MAPREDUCE-6335.005.patch
- 17 kB
- Sangjin Lee
- MAPREDUCE-6335.006.patch
- 14 kB
- Sangjin Lee
- YARN-3437.001.patch
- 28 kB
- Sangjin Lee
- YARN-3437.002.patch
- 28 kB
- Sangjin Lee
- YARN-3437.003.patch
- 28 kB
- Sangjin Lee
- YARN-3437.004.patch
- 28 kB
- Sangjin Lee
Issue Links
- is depended upon by
-
MAPREDUCE-6337 add a mode to replay MR job history files to the timeline service
- Resolved
-
YARN-3390 Reuse TimelineCollectorManager for RM
- Resolved
- is part of
-
YARN-2928 YARN Timeline Service v.2: alpha 1
- Resolved
- is related to
-
YARN-2556 Tool to measure the performance of the timeline server
- Resolved
- relates to
-
MAPREDUCE-6546 reconcile the two versions of the timeline service performance tests
- Resolved
Activity
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12709078/YARN-3437.001.patch
against trunk revision 6a6a59d.
-1 patch. The patch command could not apply the patch.
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7206//console
This message is automatically generated.
Thanks sjlee0 for delivering a patch here!
Just quickly go through the patch, looks like we are generating one app collector per map task. I think this is good for scalability test on backend storage which can be a bottleneck in mainstream cases. In addition, do we want to address some extreme cases, e.g. a huge applications will have hundreds of thousands or even millions tasks? If so, then may be we want to know a single app collector's bottleneck as well for accepting/forwarding messages from hundreds of thousands maps. Also, in a real cluster, the mapping from cluster to app, and app to tasks are all 1-N mapping. May be making app aggregator number configurable (just like map task number, and byte per map, etc.) is something we can do for next step?
BTW, it has some duplicated code with YARN-2556 (like TimelineServerPerformance.java). Looks like YARN-2556 is in pretty good shape and possible to go to trunk and branch-2 quickly. I would remind to keep watching that JIRA status and do necessary rebase work if that patch go in and we may want to merge it into YARN-2928 branch soon.
Thanks junping_du for your comments!
I agree with most of your comments. This particular subtask generates a pretty simple model (just a straightforward conversion of YARN-2556) where each map task creates one app collector and simple entities. But we will add more test modes that will have different characteristics. For one, we will add a mode where it iterates over MR job history files and pump them into the timeline storage.
The number of mappers is a pretty useful way of controlling the parallelism of tests, and we can add more parameters to have finer-grained parallelism controls (one app collector manager having multiple app collectors, etc.).
BTW, it has some duplicated code with
YARN-2556(like TimelineServerPerformance.java).
That's because I copied it (as mentioned in the description). I had to fix a few issues with the original patch there to get it going. Some appear to be unrelated to porting it to the timeline service v.2. I'll add some comments there to help it get committed.
Forgot to add that I copied it to get the load testing going. Once YARN-2556 gets committed on trunk, I'd need to update this to handle both the old timeline service and the new timeline service.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12725758/YARN-3437.002.patch
against trunk revision 1b89a3e.
-1 patch. The patch command could not apply the patch.
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7354//console
This message is automatically generated.
Thanks zjshen! Could someone please commit this patch unless there are further comments?
Now that I have dug into timeline server performance (YARN-3448). I have a better understanding of what type of writes are costly. For example, a single entity will generate dozens or writes to the database. The number of primary keys, the number of related entities, and the write batch size (entities per put) greatly affect the time an entity put takes. While this is a good start, I think there should at least be a follow up that addresses these issues to better measure the write performance.
jeagles, thanks for your comment.
Yes, I agree the level of reporting in this patch is mostly basic and very high level. I definitely agree that we need to add more reporting in terms of the actual number of writes/puts done on the storage, etc. The first use of this tool is to compare the overall throughput/performance of the storage implementations and identify high level issues. I suspect for that the coarse-grained reporting might be enough.
As a follow-up JIRA, I'll add a new JIRA that reports more fine-grained write performance data.
Filed a new JIRA to track fine-grained performance data sounds good to me. Given the patch here have duplicated code with YARN-2556, I would like to understand what's our plan for YARN-2556. jeagles, can you share your vision on this?
Looks like this JIRA block YARN-3390 (a refactor JIRA) which block YARN-3044 (RM writing events to v2 ATS service). I would like to have a clear path to make all patches goes in as a pipeline with getting ride of any potential deadlock.
May be the first step is to make YARN-2556 get committed it, and get patch here rebased? jeagles, sjlee0 and zjshen, what's your opinion on this?
I think we need to make progress on this as this is blocking other JIRAs and also it's tied to the schema evaluation.
My vote is to get this committed, and adjust this once YARN-2556 lands and we rebase. Thoughts?
I agree too if we don't have clear plan for YARN-2556 so far. There should be no reason to block other going efforts.
An suggestion (optional only) is: can we adjust name (or package path) slightly for duplicated file (TimelineServerPerformance.java) with YARN-2556? We can have an additional patch to remove duplicated file when YARN-2556 get in trunk.
I assume this could be easier for YARN-2928 rebase back to trunk/branch-2 as less conflict. Thoughts?
The downside is code duplication, but as you said we would not be impacted when/if YARN-2556 lands. If others are comfortable with that, I could quickly move it to a different package. Let me know soon, and I'll update the patch.
The downside is code duplication.
Agree. However, I think this is the least price we could pay when developing on different branches in parallel.
If others are comfortable with that, I could quickly move it to a different package.
I am OK with this in case we have additional JIRA to track removing duplicated code after YARN-2556 land. We can also add some credit to contributors of YARN-2556 in commit/CHANGE messages which should be very common.
Patch v.3. It's just a rename of the test class (with a couple of constants renamed accordingly) compared to patch v.2.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
-1 | patch | 0m 0s | The patch command could not apply the patch during dryrun. |
Subsystem | Report/Notes |
---|---|
Patch URL | http://issues.apache.org/jira/secure/attachment/12727346/YARN-3437.003.patch |
Optional Tests | javac unit findbugs checkstyle javadoc |
git revision | trunk / 0ebe84d |
Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7459//console |
This message was automatically generated.
Thanks sjlee0 for updating the patch!
The latest patch looks good in overall, some minor comments:
+ public TimelineCollector putIfAbsent(ApplicationId appId, + TimelineCollector collector) { + String id = appId.toString(); + TimelineCollector collectorInTable; + boolean collectorIsNew = false; + synchronized (collectors) { + collectorInTable = collectors.get(id); + if (collectorInTable == null) { + try { + // initialize, start, and add it to the collection so it can be + // cleaned up when the parent shuts down + collector.init(getConfig()); + collector.start(); + collectors.put(id, collector); + LOG.info("the collector for " + id + " was added"); + collectorInTable = collector; + collectorIsNew = true; + } catch (Exception e) { + throw new YarnRuntimeException(e); + } + } else { + String msg = "the collector for " + id + " already exists!"; + LOG.error(msg); + } + } + + if (collectorIsNew) { + postPut(appId, collector); + } + + return collectorInTable; + }
I understand this code piece is moved from other place. However, I think it need to be improved:
- For performance perspective, we should move LOG.info() out of synchronized block (may be move out of collector.start()?).
- we don't need to LOG.ERROR (replace with INFO?) if collector exists, general semantic for putIfAbsent should allow put the same object in concurrent threads.
For remove(), similar that we should move collector.stop() and LOG.info() out of synchronized block.
Thanks for the review junping_du!
For performance perspective, we should move LOG.info() out of synchronized block (may be move out of collector.start()?).
I can move the LOG.info() call outside the synchronized block. That said, I don't think this would have a meaningful performance impact. Aside from the fact that logging calls are usually synchronized themselves, it is reasonable to expect that the contention for this lock (collectors) would be quite low. We're talking about contention when multiple AMs are competing to create collectors on the same node, and the chances that there is any contention on this lock would be very low.
Also, when you said "may be move out of collector.start()", did you mean moving the collector.start() call outside the synchronization block? If so, I'd be hesitant to do that. We just had a discussion on this in another JIRA (see https://issues.apache.org/jira/browse/YARN-3390?focusedCommentId=14508121&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14508121).
we don't need to LOG.ERROR (replace with INFO?)
That is a good suggestion. I'll update this (and remove()) to lower the logging level for this.
For remove(), similar that we should move collector.stop() and LOG.info() out of synchronized block.
This we can do safely. I'll update the patch.
Patch v.4.
- moved logging statements out of the synchronized blocks
- dropped logging level from ERROR to INFO
- reduced the synchronization scope in remove()
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
-1 | patch | 0m 0s | The patch command could not apply the patch during dryrun. |
Subsystem | Report/Notes |
---|---|
Patch URL | http://issues.apache.org/jira/secure/attachment/12727521/YARN-3437.004.patch |
Optional Tests | javac unit findbugs checkstyle javadoc |
git revision | trunk / a100be6 |
Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7466//console |
This message was automatically generated.
Also, when you said "may be move out of collector.start()", did you mean moving the collector.start() call outside the synchronization block? If so, I'd be hesitant to do that. We just had a discussion on this in another JIRA.
I see. We don't want the other concurrent thread get a non-start collector. We can improve this later (in some other JIRA), e.g. attach an additional monitor object to specific collector (rather than sync on all collectors), and do something like: startIfNotStarted().
Another NIT is to replace Collections.synchronizedMap with ConcurrentHashMap, the later one has better performance and better safety (never throw ConcurrentModificationException) but just not keep order of element which we don't need in our case. Given we have YARN-3390 to continue refactor work on the same piece of code, we can continue the discussion there.
Latest patch LGTM. +1. Will go ahead to commit it within 24h if no object/further comments from others.
Thanks Junping. I initially went with ConcurrentHashMap when I first created this as that is my preference as well. But it was really preventing multiple threads from starting their collector (should that situation arise) that made ConcurrentHashMap not an option. Again, if we want both, we would need to look at the LoadingCache. But since this is really a low contention situation, it would be an overkill. The chances of this code running into a lock contention should be low.
Sorry for the late comments. This patch has half MR code and half YARN code. It's not good to commit it as one patch. I have one thought of managing the commits:
1. The YARN code is nearly duplicate with YARN-3390. As YARN-3390 is almost ready, we can get that patch in first.
2. Move this jira to MR project, only retain the MR code in the patch and do some minor rebase according to YARN-3390.
3. TimelineServicePerformanceTest is in different package and has the different name. Hopefully it won't conflict with YARN-2556. So once YARN-2556 gets committed, we just need to refactor TimelineServicePerformanceTest to reuse YARN-2556 code. BTW, can we put TimelineServicePerformanceTest into the same package of TimelineServicePerformance in YARN-2556, and rename it to TimelineServicePerformanceTestv2?
How do you think about the plan for the commits?
W.R.T to the patch, I'm a bit concerned that the write which contains one event per entity is not so typical to represent real use case. And configuration and metrics are even not covered. Is it more realistic to write an entity with 10 events and 10 metrics, which have 100 points in the time series?
And one nit in the patch: entity.setEntityType("TEZ_DAG_ID");. How about not mentioning TEZ in the MR code?
Thanks for your comments zjshen.
1. The YARN code is nearly duplicate with
YARN-3390. AsYARN-3390is almost ready, we can get that patch in first.
2. Move this jira to MR project, only retain the MR code in the patch and do some minor rebase according toYARN-3390.
Let's do this. While I was working on YARN-3438, I'm realizing that for the performance tests it is probably OK to use the TimelineCollectors directly and bypass the TimelineCollectorManager altogether. If we do that, then this could become purely a MR patch. I'll update this patch to remove the use of TimelineCollectorManager and move this JIRA to MAPREDUCE. How's that sound?
3. TimelineServicePerformanceTest is in different package and has the different name. Hopefully it won't conflict with
YARN-2556. So onceYARN-2556gets committed, we just need to refactor TimelineServicePerformanceTest to reuseYARN-2556code. BTW, can we put TimelineServicePerformanceTest into the same package of TimelineServicePerformance inYARN-2556, and rename it to TimelineServicePerformanceTestv2?
That's fine. I'll move it back to the same package.
W.R.T to the patch, I'm a bit concerned that the write which contains one event per entity is not so typical to represent real use case. And configuration and metrics are even not covered. Is it more realistic to write an entity with 10 events and 10 metrics, which have 100 points in the time series?
And one nit in the patch: entity.setEntityType("TEZ_DAG_ID");. How about not mentioning TEZ in the MR code?
Note that this is adding simple entity writes. The more realistic part of the test is coming in YARN-3438 (I'm nearly finished with that), and it will have multiple levels of entities as well as metrics and configuration. We would use that one for more realistic load whereas we could keep this mode as a simpler test. Thoughts?
I'll change the name of the entity to be something else.
Well it's not entirely true. It seems I still need to change TimelineCollector.getTimelineEntityContext() from protected to public. But creating another YARN JIRA just to make those several lines of changes seems too much. Thoughts folks?
How's that sound?
It's also good to me.
We would use that one for more realistic load whereas we could keep this mode as a simpler test. Thoughts?
It's okay to make it a simpler case, but could we at least cover one config, and one metric, hence we can verify the db that storing this info also works?
But creating another YARN JIRA just to make those several lines of changes seems too much
A couple of lines change in YARN for MR patch is okay.
Oh, previously I said TimelineServicePerformanceTestv2, but actually I meant TimelineServicePerformanceV2. Just a minor suggestion, and it's up to you to find the suitable class name.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
-1 | patch | 0m 0s | The patch command could not apply the patch during dryrun. |
Subsystem | Report/Notes |
---|---|
Patch URL | http://issues.apache.org/jira/secure/attachment/12727521/YARN-3437.004.patch |
Optional Tests | javac unit findbugs checkstyle javadoc |
git revision | trunk / ef4e996 |
Console output | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5438/console |
This message was automatically generated.
Patch v.5
- the test class moved back to org.apache.hadoop.mapred and renamed
- removed the TimelineCollectorManager changes and used TimelineCollector directly
- added a metric and a config and renamed the entity
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
-1 | patch | 0m 0s | The patch command could not apply the patch during dryrun. |
Subsystem | Report/Notes |
---|---|
Patch URL | http://issues.apache.org/jira/secure/attachment/12727743/MAPREDUCE-6335.005.patch |
Optional Tests | javac unit findbugs checkstyle javadoc |
git revision | trunk / ef4e996 |
Console output | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5439/console |
This message was automatically generated.
Move it inside of MAPREDUCE-6331 which is MR umbrella for v2 timeline service.
TimelineServicePerformanceV2 looks much better. I still have one question. System.runFinalization(); is borrowed from YARN-2556, but is it any reason why we need to invoke object finalization explicitly here? Will it cause some performance impact?
BTW, YARN side change is not necessary after YARN-3390.
Yeah it's not clear to me why System.runFinalization() is there. I'm going to remove it. I'll also remove the YARN changes. Will update the patch shortly.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | pre-patch | 5m 18s | Pre-patch trunk compilation is healthy. |
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | tests included | 0m 0s | The patch appears to include 2 new or modified test files. |
+1 | whitespace | 0m 0s | The patch has no lines that end in whitespace. |
-1 | javac | 4m 40s | The patch appears to cause the build to fail. |
Subsystem | Report/Notes |
---|---|
Patch URL | http://issues.apache.org/jira/secure/attachment/12728615/MAPREDUCE-6335.006.patch |
Optional Tests | javac unit findbugs checkstyle |
git revision | trunk / feb68cb |
Console output | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5461/console |
This message was automatically generated.
+1 for the last patch. Will commit it if no more comments in the following hours.
Committed the patch to branch YARN-2928. Thanks for the patch, Sangjin! Thanks for review, Junping!
SUCCESS: Integrated in Hadoop-trunk-Commit #10074 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10074/)
MAPREDUCE-6335. Created MR job based performance test driver for the (sjlee: rev 8c7b6dd2c7ed2528b125e86c0730ce4242ff4ad8)
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TimelineServicePerformanceV2.java
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/test/MapredTestDriver.java
Patch v.1 posted.
This is basically a modification of the
YARN-2556patch (and clean-up of issues etc.) to work against the timeline service v.2.Since the new distributed timeline service collectors are tied to applications, I chose the approach of instantiating the base timeline collector within the mapper task, rather than going through the timeline client. Making it go through the timeline client has a number of challenges (see YARN-3378). But this should be still effective as a way to exercise the bulk of the write performance and scalability.
You can try this out by doing for example
You'll get the output like
It is still using pretty simple entities to write to the storage. I'll work on adding handling job history files later in a different JIRA.
I would greatly appreciate your review. Thanks!