Under concurrent collection creation, wrong Autoscaling placement decisions can lead to severely unbalanced clusters.
Sequential creation of the same collections is handled correctly and the cluster is balanced.
TL;DR; under high load, the way sessions that cache future changes to Zookeeper are managed cause placement decisions of multiple concurrent Collection API calls to ignore each other, be based on identical “initial” cluster state, possibly leading to identical placement decisions and as a consequence cluster imbalance.
Some context first for those less familiar with how Autoscaling deals with cluster state change: a PolicyHelper.Session is created with a snapshot of the Zookeeper cluster state and is used to track already decided but not yet persisted to Zookeeper cluster state changes so that Collection API commands can make the right placement decisions.
A Collection API command either uses an existing cached Session (that includes changes computed by previous command(s)) or creates a new Session initialized from the Zookeeper cluster state (i.e. with only state changes already persisted).
When a Collection API command requires a Session - and one is needed for any cluster state update computation - if one exists but is currently in use, the command can wait up to 10 seconds. If the session becomes available, it is reused. Otherwise, a new one is created.
The Session lifecycle is as follows: it is created in COMPUTING state by a Collection API command and is initialized with a snapshot of cluster state from Zookeeper (does not require a Zookeeper read, this is running on Overseer that maintains a cache of cluster state). The command has exclusive access to the Session and can change the state of the Session. When the command is done changing the Session, the Session is “returned” and its state changes to EXECUTING while the command continues to run to persist the state to Zookeeper and interact with the nodes, but no longer interacts with the Session. Another command can then grab a Session in EXECUTING state, change its state to COMPUTING to compute new changes taking into account previous changes. When all commands having used the session have completed their work, the session is “released” and destroyed (at this stage, Zookeeper contains all the state changes that were computed using that Session).
The issue arises when multiple Collection API commands are executed at once. A first Session is created and commands start using it one by one. In a simple 1 shard 1 replica collection creation test run with 100 parallel Collection API requests (see debug logs from PolicyHelper in file policy.logs), this Session update phase (Session in COMPUTING status in SessionWrapper) takes about 250-300ms (MacBook Pro).
This means that about 40 commands can run by using in turn the same Session (45 in the sample run). The commands that have been waiting for too long time out after 10 seconds, more or less all at the same time (at the rate at which they have been received by the OverseerCollectionMessageHandler, approx one per 100ms in the sample run) and most/all independently decide to create a new Session. These new Sessions are based on Zookeeper state, they might or might not include some of the changes from the first 40 commands (depending on if these commands got their changes written to Zookeeper by the time of the 10 seconds timeout, a few might have made it, see below).
These new Sessions (54 sessions in addition to the initial one) are based on more or less the same state, so all remaining commands are making placement decisions that do not take into account each other (and likely not much of the first 44 placement decisions either).
The sample run whose relevant logs are attached led for the 100 single shard single replica collection creations to 82 collections on the Overseer node, and 5 and 13 collections on the two other nodes of a 3 nodes cluster. Given that the initial session was used 45 times (once initially then reused 44 times), one would have expected at least the first 45 collections to be evenly distributed, i.e. 15 replicas on each node. This was not the case, possibly a sign of other issues (other runs even ended up placing 0 replicas out of the 100 on one of the nodes).
From the client perspective, http admin collection CREATE requests averaged 19.5 seconds each and lasted between 7 and 28 seconds (100 parallel threads). This is likely an indication that the last 55 collection creations didn’t see much of the state updates done by the first 45 creations (client delay is longer though than actual Overseer command execution time by http time + Collections API Zookeeper queue time) .
A possible fix is to not observe any delay before creating a new Session when the currently cached session is busy (i.e. COMPUTING). It will be somewhat less optimal in low load cases (this is likely not an issue, future creations will compensate for slight unbalance and under optimal placement) but will speed up Collection API calls (no waiting) and will prevent multiple waiting commands from all creating new Sessions based on an identical Zookeeper state in cases such as the one described here. For long (minutes and more) autoscaling computations it will likely not make a big difference.
If we had more than a single Session being cached (and reused), then less ongoing updates would be lost.
Maybe, rather than caching the new updated cluster state after each change, the changes themselves (the deltas) should be tracked. This might allow to propagate changes between sessions or to reconcile cluster state read from Zookeeper with the stream of changes stored in a Session by identifying which deltas made it to Zookeeper, which ones are new from Zookeeper (originating from an update in another session) and which are still pending.