Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.18.0
-
None
-
None
Description
Problem
When the initial job requests many containers from yarn, it is easy to get more than needed containers for that the YARN AM-RM protocol is not a delta protocol (please see YARN-1902). For example, we are needing 3000 containers. Consider the following case.
Case one:
- The job requests 2000 containers firstly and then the yarn client has 2000 requests.
- The yarn heartbeat happens and the yarn client request 2000 containers to yarn rm.
- The job requests another 1000 containers and the the yarn client has 3000 requests.
- The yarn heartbeat happens and the yarn client request 3000 containers to yarn rm.
- On heartbeat finish, yarn rm returns 2000 containers. After the callback the method onContainersAllocated and removeContainerRequest, yarn client has 1000 requests.
- The yarn heartbeat happens and the yarn client request 1000 containers to yarn rm.
- On heartbeat finish, yarn rm returns 3000 containers. After the callback the method onContainersAllocated and removeContainerRequest, yarn client has 0 requests.
- The yarn heartbeat happens.
- On heartbeat finish, yarn rm returns 1000 containers which are excess since the last client request number is 1000.
In the end, the yarn may allocate 2000 + 3000 + 1000 = 6000 containers. But we only need 3000 containers and should return 3000 containers.
Case two:
- The job requests 3000 containers firstly and the the yarn client has 3000 requests.
- The yarn heartbeat happens and the yarn client request 3000 containers to yarn rm.
- On heartbeat finish, yarn rm returns 1000 containers(2000 allocating). After the callback the method onContainersAllocated and removeContainerRequest, yarn client has 2000 requests.
- The yarn heartbeat happens and the yarn client request 2000 containers to yarn rm.
- On heartbeat finish, yarn rm returns 2000 containers. After the callback the method onContainersAllocated and removeContainerRequest, yarn client has 0 requests.
- The yarn heartbeat happens.
- On heartbeat finish, yarn rm returns 2000 containers which are excess since the last client request number is 2000.
In the end, the yarn may allocate 1000 + 2000 + 2000 = 5000 containers. But we only need 3000 containers and should return 2000 containers.
The reason is that any update to the yarn client's requests may produce undesired behavior.
Solution
In our inner flink version, we use two ways to resolve the problem as following:
- Compute the total resource requests at start and request by batch to avoid being interrupted by yarn heartbeat. Note that we loop resourceManagerClient.addContainerRequest(containerRequest)) to simulate batch-request quickly.
- Remove the yarn client's container requests after receiving enough resources to avoid request update.