Sometimes, after our topologies have been running for a while, Zookeeper does not respond within an appropriate time and we see
2017-08-16 10:18:38.859 o.a.s.zookeeper [INFO] ip-10-181-20-70.ec2.internal lost leadership.
2017-08-16 10:21:31.144 o.a.s.zookeeper [INFO] ip-10-181-20-70.ec2.internal gained leadership, checking if it has all the topology code locally.
2017-08-16 10:21:46.201 o.a.s.zookeeper [INFO] Accepting leadership, all active topology found localy.
That's fine, and we probably need to allocate more resources. But after a new leader is chosen, we then see:
o.a.s.b.BlobStoreUtils [ERROR] Could not update the blob with key<key>
over and over.
I can't figure out yet how to cause the conditions that lead to Zookeeper becoming unresponsive, but it is possible to reproduce the BlobStoreUtils error by restarting Zookeeper.
The problem, I think, is that the loop here never executes because the nimbusInfos list is empty. If I add a check similar to this for a node which exists but has no children, the error goes away.