The root cause of this bug was a race condition while using the ControllerBrokerRequestBatch that assumes synchronization at the caller. This patch synchronizes access to the ControllerBrokerRequestBatch while sending out the StopReplicaRequest.
While working on this and testing the patch, I noticed some inefficiency with the controller shutdown API. When we move the leader for a partition one at a time, we also remove the broker being shutdown from the ISR. This involves at least one read and one write per partition to zookeeper, sometimes more. Besides being slow, this operation is not effective. This is because the broker being shutdown still has alive fetchers to the new leader before it receives and acts on the StopReplicaRequest. So the new leader adds it back to the ISR anyways.
Then I thought I could move it to the end of the controlled shutdown API where we check if all partitions are successfully moved, then send StopReplicaRequest to the broker being shut down. Right after this, we could move the replica to Offline and remove it from ISR as part of that operation. Even if we can invoke this operation as a batch, it will still be slow since the zookeeper reads/writes will happen serially. Also, I realized this is not required as well for 2 reasons -
1. Since the shutting broker is sent the StopReplicaRequest, it will stop the fetcher and fall out of ISR. Until then, as long as the controller doesn't failover, it will not be elected as the leader, even if it is in the ISR, since it is one of the shuttingDownBrokers.
2. Even if the controller fails over, by the time the new controller has initialized and is ready to serve, the StopReplicaRequest would've ensured that the shutdown broker is no longer in the ISR. And until the controller fails over, there cannot be any leader election anyway.
I've tested this patch on a 7 node cluster that continuously gets rolling bounced and has ~100 producers sending production data to it with ~1000 consumers consuming data from it.