Steps to reproduce:
- create a cluster of a few nodes (tested with 7 nodes)
- define per-collection policies that distribute replicas exclusively on different nodes per policy
- concurrently create a few collections, each using a different policy
- resulting replica placement will be seriously wrong, causing many policy violations
Running the same scenario but instead creating collections sequentially results in no violations.
I suspect this is caused by incorrect locking level for all collection operations (as defined in CollectionParams.CollectionAction) that create new replica placements - i.e. CREATE, ADDREPLICA, MOVEREPLICA, DELETENODE, REPLACENODE, SPLITSHARD, RESTORE, REINDEXCOLLECTION. All of these operations use the policy engine to create new replica placements, and as a result they change the cluster state. However, currently these operations are locked (in OverseerCollectionMessageHandler.lockTask ) using LockLevel.COLLECTION. In practice this means that the lock is held only for the particular collection that is being modified.
A straightforward fix for this issue is to change the locking level to CLUSTER (and I confirm this fixes the scenario described above). However, this effectively serializes all collection operations listed above, which will result in general slow-down of all collection operations.