Collection API and ConfigSet API messages can be processed out of order by the OverseerTaskProcessor/OverseerCollectionConfigSetProcessor because of the way locking is handled.
The Overseer task processor dequeues messages from the Collection and ConfigSet Zookeeper queue at /overseer/collection-queue-work and hands processing to an executor pool with 100 max parallel tasks (ClusterStateUpdater on the other hand uses a single thread consuming and processing the /overseer/queue and does not suffer from the problem described here).
Locking prevents tasks modifying the same or related data from running concurrently. For the Collection API, locking can be done at CLUSTER, COLLECTION, SHARD or REPLICA level and each command has its locking level defined in CollectionParams.CollectionAction.
Multiple tasks for the same or competing locking levels do not execute concurrently. Commands locking at COLLECTION level for the same collection (for example CREATE creating a collection, CREATEALIAS creating an alias for the collection and CREATESHARD creating a new shard for the collection) will be executed sequentially, and it is expected that they are executed in submission order. Commands locking at different levels are also serialized when needed (for example changes to a shard of a collection and changes to the collection itself).
The way OverseerTaskProcessor.run() deals with messages that cannot be executed due to competing messages already running (i.e. lock unavailable) is not correct. The method basically iterates over the set of messages to execute, skips those that can’t be executed due to locking and executes those that can be (assuming enough remaining available executors).
The issue of out of order execution occurs when at least 3 messages compete for a lock. The following scenario shows out of order execution (messages numbered in enqueue order):
- Message 1 gets the lock and runs,
- Message 2 can’t be executed because the lock is taken,
- Message 1 finishes execution and releases the lock,
- Message 3 can be executed, gets the lock and runs,
- Message 2 eventually runs when Message 3 finishes and releases the lock.
We therefore have execution order 1 - 3 - 2 when the expected execution order is 1 - 2 - 3. Order 1 - 3 - 2 might not make sense or result in a different final state from 1 - 2 - 3.
(there’s a variant of this scenario in which after Message 2 can't be executed the number of max parallel tasks is reached, then remaining tasks in heads including Message 3 are put into the blockedTasks map. These tasks are the first ones considered for processing on the next iteration of the main while loop. The reordering is similar).
Note that from the client perspective, there does not need to be 3 outstanding tasks, 2 are sufficient. The third task required for observing reordering could be enqueued after the first one completed.
Impact of the issue:
This problem makes MultiThreadedOCPTest.testFillWorkQueue() fail because the test requires task execution in submission order (
The messages required for reordering to happen are not necessarily messages at the same locking level: a message locking at SHARD or REPLICA level prevents execution of a COLLECTION level message for the same collection. Examples can be built of sequences of messages that lead to incorrect results or failures. For example ADDREPLICA followed by CREATEALIAS followed by DELETEALIAS, could see alias creation and deletion reordered, making no sense.
I wasn’t able to come up with a realistic production example that would be impacted by this issue (doesn't mean one doesn't exist!).