Here is a sequence of 2 disconnects and re-connects
In the first disconnect the sequence is - get disconnect watcher, execute disconnect code, execute connect code
In the second disconnect the sequence is - get disconnect watcher, execute connect code, execute disconnect code
In the second sequence of events, if the JVM has leader replicas then all updates start failing with "Cannot talk to ZooKeeper - Updates are disabled." . This starts happening exactly after 27 seconds ( zk client timeout is 30s , 90% of 30 = 27 - when the code thinks the session is likely expired). No leadership changes since there was no session expiry. Unless you restart the node all updates to the system continue to fail.
These log lines correspond are from Solr 5.3 hence where the WatchedEvent was still being logged as INFO
We process the connect code and then process the disconnect code out of order based on the log ordering. The connection is active but the flag is not set and hence after 27 seconds zkCheck starts complaining that the connection is likely expired
A related Jira is
ZK gives us ordered watch events ( https://zookeeper.apache.org/doc/r3.4.8/zookeeperProgrammers.html#sc_WatchGuarantees ) but from what I understand Solr can still process them out of order. We could take a lock and synchronize ConnectionManager#connected and ConnectionManager#disconnected .
Would that be the right approach to take?