We found a test hanging on an IW#forceMerge on elastics CI on an innocent looking test:
after spending quite some time trying to reproduce without any luck I tried to review all involved code again to understand possible threading issues. What I found is that if maxMergeCount gets changed on CMS while there are merges running and the forceMerge gets kicked off at the same time the running merges return we might miss to pick up the final pending merges which causes the forceMerge to hang. I was able to build a test-case that is very likely to fail on every run without the fix. While I think this is not a critical bug from how likely it is to happen in practice, if it happens it's basically a deadlock unless the IW sees any other change that kicks off a merge.
Lemme walk through the issue. Lets say we have 1 pending merge and 2 merge threads running on CMS. The forceMerge is already waiting for merges to finish. Once the first merge thread finishes we try to check if we need to stall it here but since it's a merge thread we return here and don't pick up another merge here.
Now the second running merge thread checks the condition here while the first one is finishing up. But before it can actually update the internal datastructures here it releases the CMS lock and the calculation in the stall method on how many threads are running is off causing the second thread also to step out of the maybeStall method not picking up the pending merge.