Description
Regarding AMRMClientImpl
Scenario 1:
Given a ContainerRequest x with Resource y, when addContainerRequest is called z times with x, allocate is called and at least one of the z allocated containers is started, then if another addContainerRequest call is done and subsequently an allocate call to the RM, (z+1) containers will be allocated, where 1 container is expected.
Scenario 2:
No containers are started between the allocate calls.
Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) are requested in both scenarios, but that only in the second scenario, the correct behavior is observed.
Looking at the implementation I have found that this (z+1) request is caused by the structure of the remoteRequestsTable. The consequence of Map<Resource, ResourceRequestInfo> is that ResourceRequestInfo does not hold any information about whether a request has been sent to the RM yet or not.
There are workarounds for this, such as releasing the excess containers received.
The solution implemented is to initialize a new ResourceRequest in ResourceRequestInfo when a request has been successfully sent to the RM.
The patch includes a test in which scenario one is tested.
Attachments
Attachments
- YARN-1902.patch
- 17 kB
- Sietse T. Au
- YARN-1902.v2.patch
- 17 kB
- Sietse T. Au
- YARN-1902.v3.patch
- 17 kB
- Sietse T. Au
Issue Links
- causes
-
YARN-9877 Intermittent TIME_OUT of LogAggregationReport
- Resolved
- is related to
-
SLIDER-829 when containers are allocated, explicitly cancel the request
- Resolved
- relates to
-
YARN-110 AM releases too many containers due to the protocol
- Open
Activity
This bug actually also can cause application crashes, if the application handles "ContainerAllocated"-events by stockpiling them, and then scheduling tasks to these containers as they arrive. This usually leads to timeouts of the involved token, and very interesting guesswork, why program logic is attempting to launch containers that have been assigned obsolete tokens.
I also wonder how this mixes with the recent addition of "opportunistic allocation".
Hadoop 3 would have been a great opportunity to close this bug
Sorry for jumping in late, but I'd like to keep moving this forward. There are a significant number of wasted container allocations as apps adjust their resource requests, and that adds unnecessary load to the RM and reduces cluster efficiency.
IMHO without a protocol overhaul the fix has to come from the RM side if we want to minimize the excess containers. The inherent problem is that the AM is adjusting its resource request without the knowledge of what the RM has already allocated since the last heartbeat. Therefore if the RM sees an update to the AM ask during a heartbeat and that same heartbeat's response has containers already allocated it needs to adjust the AM's ask by the containers that are in that response. For example:
- AM asks for 5 containers
- On a subsequent heartbeat the RM responds with 1 container
- On the next heartbeat the AM adjusts its ask to 4 containers, but the RM has already allocated the remaining 4 containers from the original ask.
- The RM needs to interpret the new ask not as 4 more containers but as 0 containers since 4 of them are already satisfied in the current heartbeat's response.
If apps were well behaved, I think we could get most of the benefit by simply adjusting the new total (ANY) ask update by the number of containers in the same heartbeat's response. It's true that an AM could get containers in that response that don't match its request, but a well-behaved app should realize that any container received counts against the total (ANY) resource request. Therefore if the app throws away the container but still needs another it must update at least the total container request to ask for a replacement.
I have been experimenting with the idea of changing AppSchedulingInfo to maintain a total request table, a fulfilled allocation table, and then calculate the difference of the two tables as the real outstanding request table used for scheduling. All is fine until I realized that this cannot handle one use case where a AMRMClient, right before sending the allocation heartbeat, removes all container requests, and add new container requests at the same priority and location (possibly with different resource capability). AppSchedulingInfo does not know about this, and may not treat the newly added container requests as outstanding requests.
I agree that currently I do not see a clean solution without affecting backward compatibility.
All solutions will still be workarounds unless the protocol is revised.
Another workaround would be to keep track of the requests by counting the number of requested containers and not sending new container requests to RM until the previous batch has been satisfied.
Consider the following scenario in the following order:
1. addContainerRequest is called n times and at each call the expectedContainers counter is incremented, the container request is added to a list of currentContainerRequests.
2. allocate is called, a boolean waitingForResponse is set to true when ask.size > 0 which indicates container requests have been made.
3. addContainerRequest is called m times, since waitingForResponse is true, the request will be added to a list of queuedContainerRequests, the asks will be added to asksQueue and not asks.
4. allocate is called, n - 1 containers are returned, expectedContainers will be decremented by n - 1.
5. allocate is called again, 1 container is returned, expectedContainers will be 0,
waitingForResponse is set to false,
for each currentContainerRequest removeContainerRequest,
currentContainerRequests = queuedContainerRequests,
asks = asksQueue,
expectedContainers = queuedContainerRequests.size
6. allocate is called and (3) will be submitted.
Here, the satisfied container requests will be correctly removed from the table without user intervention and seems to apply to common use cases, excess containers now will only happen when containerRequest is removed after an allocate. But since there is no guarantee that it will be removed in time at the RM, it doesn't seem to be very significant.
One problem here is that the expectedContainers will be invalid when you do the following:
blacklist all the possible nodes, add container request, allocate, remove blacklist, add container request, allocate.
This would make the client wait forever for a response of the first request as it will never be satisfied.
I'm not sure what else can be done by users apart from extending the AMRMClientImpl to fit their use case.
An alternate approach that we tried in Apache Tez is to wrap a TaskScheduler around the AMRMClient that would take request from the application and do the matching internally. Since it would know the matching, it could automatically remove the matched requests also. (Still does not remove the race condition but it cleaner wrt to the user as an API). The TaskScheduler was written to be independent of Tez code so that we could contribute it to YARN as a library, however we did not find time to do so. Now that code has evolved quite a bit but the original, well-tested code could still be extracted from Tez 0.1 branch and contributed to YARN if someone is interested in doing that work.
Thanks bikassaha and vinodkv for the education and background info. Really helpful. I can now appreciate that there is not a straightforward solution to this problem.
Originally I was coming from a pure user experience point of view, where I was thinking that if I ever want to use removeContainerRequest, it should only be because that I need to cancel previous add requests. Yes I may still get the number of containers from the previous requests, but that is understandable. However, I would have never thought that I still need to do removeContainerRequest to remove requests of matched containers in order to make the internal bookkeeping of AMRMClient correct. Why should a user worry about these things?
After reading the comments, I start to think that even if we were able to figure out which ResourceRequest to deduct from and automatically deduct it at the Client, it still won't solve race condition 1 (i.e., allocated containers are sitting in RM).
So rather than changing the client, can we not do something at the RM side? For example, in AppSchedulingInfo:
1. Maintain a table for total request only. The updateResourceRequests() call will update this table to reflect the total requests from the client (matching the client side remoteRequestsTable).
2. Maintain a table for requests that have been satisfied. Every time a successful allocation is made for this application, this table is updated.
3. The difference between table 1 and table 2 will be the outstanding resource requests. This table is updated at every updateResourceRequests() and every successful allocation. Of course proper synchronization needs to be taken care of.
4. The scheduling will be made based on the table 3 (i.e., the outstanding request table).
Do you think if this is something worth considering?
Thanks a lot in advance.
Meng
Yes. And then the RM may give a container on H1 which is not useful for the app. If we again auto-decrement and release the container then we end up with 2 outstanding requests and the job will hang because it needs 3 containers.
Wangda Tan mentioned offline that we could at-least deduct the count against the over-all number (ANY request) for a given priority.
Further thought tells me this is not desired in some cases as well.
Take the following example.
User originally wants: 1 container on H1, 1 container on H2, and 2 containers on R1 (rack). The request table becomes
H1 | 1 |
H2 | 1 |
R1 | 2 |
* | 4 |
Now assuming RM returns a container on R2 (rack), auto-decrementing the request table will make it
H1 | 1 |
H2 | 1 |
R1 | 2 |
* | 3 |
But user may actually want something like the following. This depends on what the user preferences are w.r.t scheduling.
H1 | 0 |
H2 | 1 |
R1 | 2 |
* | 3 |
This was discussed multiple times before.
Two kinds of races can happen. A resource-table deduction happens when
- allocated containers are already sitting in the RM (tracked at YARN-110)
- allocated containers are already sitting in the client library
Seems like this JIRA is talking about both (1) and (2).
The dist-shell example above sounds like it could be because of (1).
Re (2), as Bikas says, the notion of forcing apps to deduct requests after a successful allocation (using AMRMClient.removeContainerRequest()) was introduced because the library clearly doesn't have an idea of which ResourceRequest to deduct from. leftnoteasy mentioned offline that we could at-least deduct the count against the over-all number (ANY request) for a given priority. /cc bikassaha
The AMRMClient was not written to automatically remove requests because it does not know which requests will be matched to allocated containers. The explicit contract is for users of AMRMClient to remove requests that have been matched to containers.
If we change that behavior to automatically remove requests then it may lead to issues where 2 entities are removing requests. 1) user 2) AMRMClient. So that change should only be made in a different version of AMRMClient or else existing users will break.
In the worst case, if the AMRMClient (automatically) removes the wrong request then the application will hang because the RM will not provide it the container that is needed. Not automatically removing the request has the downside of getting additional containers that need to be released by the application. We chose excess containers over hanging for the original implementation.
Excess containers should happen rarely because the user controls when AMRMClient heartbeats to the RM and can do that after having removed all matched requests, so that the remote request table reflects the current state of outstanding requests. There may still be a race condition on the RM side that gives more containers. Excess containers can happen more often with AMRMClientAsync, because it heartbeats at a regular schedule and can send more requests than really outstanding if the heartbeat goes out before the user has removed the matched requests.
I was almost going to log the same issue when I saw this thread (and also YARN-3020) .
After reading all the discussions, and after reading the related code, I still believe this is a bug.
I understand what bikassaha has said that the AM-RM protocol is NOT a delta protocol, and that currently user (i.e., ApplicationMaster) is responsible for calling removeContainerRequest() after receiving an allocation, but consider the follow simple modification to the packaged distributedshell application:
--- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java @@ -805,6 +805,8 @@ public void onContainersAllocated(List<Container> allocatedContainers) { // as all containers may not be allocated at one go. launchThreads.add(launchThread); launchThread.start(); + ContainerRequest containerAsk = setupContainerAskForRM(); + amRMClient.removeContainerRequest(containerAsk); } }
The code simply removes a container request after successfully receiving an allocated container in the ApplicationMaster. When you submit this application by specifying, say, 3 containers on the CLI, you will sometimes get 4 containers allocated (not counting the AM container)!
root@node2:~# hadoop org.apache.hadoop.yarn.applications.distributedshell.Client -jar /usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.0.0-SNAPSHOT.jar -shell_command "sleep 100000" -num_containers 3 -timeout 200000000
root@node2:~# yarn container -list appattempt_1431531743796_0015_000001 15/05/15 20:49:01 INFO client.RMProxy: Connecting to ResourceManager at node2/10.211.55.102:8032 15/05/15 20:49:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Total number of containers :5 Container-Id Start Time Finish Time State Host Node Http Address LOG-URL container_1431531743796_0015_01_000005 Fri May 15 20:44:12 +0000 2015 N/A RUNNING node3:50093 http://node3:8042 http://node3:8042/node/containerlogs/container_1431531743796_0015_01_000005/root container_1431531743796_0015_01_000001 Fri May 15 20:44:06 +0000 2015 N/A RUNNING node3:50093 http://node3:8042 http://node3:8042/node/containerlogs/container_1431531743796_0015_01_000001/root container_1431531743796_0015_01_000002 Fri May 15 20:44:10 +0000 2015 N/A RUNNING node3:50093 http://node3:8042 http://node3:8042/node/containerlogs/container_1431531743796_0015_01_000002/root container_1431531743796_0015_01_000004 Fri May 15 20:44:11 +0000 2015 N/A RUNNING node3:50093 http://node3:8042 http://node3:8042/node/containerlogs/container_1431531743796_0015_01_000004/root container_1431531743796_0015_01_000003 Fri May 15 20:44:10 +0000 2015 N/A RUNNING node4:41128 http://node4:8042 http://node4:8042/node/containerlogs/container_1431531743796_0015_01_000003/root
The fundamental problem here, I believe, is that the AMRMClient maintains an internal request table remoteRequestsTable that keeps track of total container requests (i.e., including container requests that have been satisfied, and that are not yet satisfied):
protected final Map<Priority, Map<String, TreeMap<Resource, ResourceRequestInfo>>> remoteRequestsTable = new TreeMap<Priority, Map<String, TreeMap<Resource, ResourceRequestInfo>>>();
However, the corresponding table requests at the scheduler side (inside AppSchedulingInfo.java) keeps track of outstanding container requests (i.e., container requests that are not yet satisfied):
final Map<Priority, Map<String, ResourceRequest>> requests = new ConcurrentHashMap<Priority, Map<String, ResourceRequest>>();
Every time an allocation is successfully made, the decResourceRequest() or decrementOutstanding() call will update the requests table so that it only contains outstanding requests, but unfortunately, every time an ApplicationMaster heartbeat comes, the same requests table is updated by the updateResourceRequests() call with the total requests coming from AMRMClient.
This inconsistent view of total requests from AMRMClient side, and the outstanding requests from the Scheduler side, in my opinion, is very confusing to say the least.
I see that a solution has already been proposed by wangda in YARN-3020, which I think is the correct thing to do:
maybe we should add a default implementation to deduct pending resource requests by prioirty/resource-name/capacity of allocated containers automatically (User can disable this default behavior, implement their own logic to deduct pending resource requests.)
This solution will make remoteRequestsTable in AMRMClient only keep track of outstanding container requests, which is then consistent with the requests table at the Scheduler side.
Any comments or thoughts? We are currently investigating YARN-1197, and are faced with a similar issue with properly tracking container resource increase requests at both client and server side.
Thanks,
Meng
stau-tudelft, assigned this jira to you as you've contributed the patch. However, the patch doesn't apply any more. Moreover, would you please take a look at Bikas' latest comment? It seems that the API is not designed for sending the delta, after the container is allocated, you need to remove the request explicitly. Back to your example, your 1st container is allocated, you need to remove one request before next allocate call, to make sure you're going to ask for 11 more containers, but not 12.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12638931/YARN-1902.v3.patch
against trunk revision 3ca5bd1.
-1 patch. The patch command could not apply the patch.
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7090//console
This message is automatically generated.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12638931/YARN-1902.v3.patch
against trunk revision 1556f86.
-1 patch. The patch command could not apply the patch.
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5951//console
This message is automatically generated.
If all the requests are for the same location (say star) then the client needs to sends all outstanding requests to the RM. The AM-RM protocol is not a delta protocol. Sending deltas will not work because the RM expects all requests at a given location to be presented for every update. It simply sets that value from the request to its internal book-keeping objects (vs performing an increment for the "new" request).
The scenarios are such that ContainerRequest is a separate object with each addContainerRequest call.
The point is that the user does not want its previous requests to be undone, but to do an additional request, while not yet having received all the 12 containers.
Let's say for Scenario 1 that z = 12
12 containers are requested, then asynchronously you start an allocated container, after this, 1 extra container is requested. However instead of 1 extra container being requested, 13 containers will be requested. This - as far as I can see - does not necessarily involve the RM and could be solved at the client side's internal bookkeeping of requests, which is what the patch is trying to show.
Given a ContainerRequest x with Resource y, when addContainerRequest is called z times with x, allocate is called and at least one of the z allocated containers is started, then if another addContainerRequest call is done and subsequently an allocate call to the RM, (z+1) containers will be allocated, where 1 container is expected.
Firstly, I am not sure if the same ContainerRequest object can be passed multiple times in addContainerRequest. It should be different objects each time (even if they point to the same resource). This might have something to do with the internal book-keeping done for matching requests.
Secondly, after z requests are made and 1 allocation is received then z-1 requests remain. If you are using AMRMClientImpl then its your (users) responsibility to call removeContainerRequest() for the request that was matched to this container. The AMRMClient does not know which of your z requests could be assigned to this container. So in the general case, it cannot automatically remove a request from the internal table because it does not know which request to remove. If the javadocs dont clarify these semantics then we can improve the javadocs.
Thirdly, the protocol between the AMRMClient and the RM has an inherent race. So if the client had earlier asked for z containers and in the next heartbeat reduces that to z-1, the RM may actually return z containers to it because it had already allocated them to this client before the client updated the RM with the new value.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12638931/YARN-1902.v3.patch
against trunk revision 7aa667e.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 2 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client:
org.apache.hadoop.yarn.client.TestResourceTrackerOnHA
org.apache.hadoop.yarn.client.TestApplicationMasterServiceOnHA
org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5083//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5083//console
This message is automatically generated.
Is this the right way to approach this issue or should there be a more thorough analysis of the current implementation?
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12638931/YARN-1902.v3.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 2 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3520//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3520//console
This message is automatically generated.
Moved the managing of administration of resource requests to the critical section before the allocate. This is better because the asks will be retried anyway, and calls to addContainerRequest right after allocate won't be added to the previous request.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12638928/YARN-1902.v2.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 2 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3519//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3519//console
This message is automatically generated.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12638692/YARN-1902.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 2 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3517//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3517//console
This message is automatically generated.
Oh, and Hadoop 2.7 can also be added to the affected version list.