Description
We observed an unexpected behaviour where a node was taking traffic for a replica that was not ready to take it. It seems to happen when the node is marked as live and the replica is marked as active, while the corresponding core is not loaded yet on the node.
I looked at the code and in theory it should not happen, since the following happens in ZkController#init: mark node as down, wait for replicas to be marked as down, and then register the node as live. However, after looking at the code of publishAndWaitForDownStates, I observed that we wait for down states for replicas associated with cores as returned by CoreContainer#getCoreDescriptors... which is empty at this point since ZkController#init is called before cores are discovered (which happens later in CoreContainer#load).
It hence seems to me that we basically never wait for any replicas to be marked as down, and continue the startup sequence by marking the node as live, and hence might take traffic for a short period of time for a replica that is not ready (e.g., if the node previously crashed and the replica stayed active).
Attachments
Issue Links
- links to