Affects Version/s: 3.2.0, 3.1.1
Fix Version/s: None
Component/s: capacity scheduler
We are using placement constaints anti-affinity in an application along with node label. The application requests two containers with anti affinity on the node label containing only two nodes.
So two containers will be allocated in the two nodes, one on each node satisfying anti-affinity.
When one nodemanager goes down for some time, the node is marked as lost by RM and then it will kill all containers in that node.
The AM will now have one pending container request, since the previous container got killed.
When the Nodemanager becomes up after some time, the pending container is not getting allocated in that node again and the application has to wait forever for that container.
If the ResourceManager is restarted, this issue disappears and the container gets allocated on the NodeManager which came back up recently.
This seems to be an issue with the allocation tags not removed.
The allocation tag is added for the container container_e68_1595886973474_0005_01_000003 .
However, the allocation tag is not removed when the container container_e68_1595886973474_0005_01_000003 is released. There is no equivalent DEBUG message seen for removing tags. This means that the tags are not getting removed. If the tag is not removed, then scheduler will not allocate in the same node due to anti-affinity resulting in the issue observed.
This seems to be due to changes done in
YARN-8511 . Change here was made to remove the tags only after NM confirms container is released. However, in our scenario this is not happening. So the tag will never get removed until RM restart.
YARN-8511 fixes this particular issue and tags are getting removed. But this is not a valid solution since the problem that YARN-8511 solves is also valid. We need to find a solution which does not break YARN-8511 and also fixes this issue.