No recovery will be started if a node cannot talk to zookeeper.
Well, I knew that. I meant that the two Solrs where disconnected from ZK at the same time, but of course both got their connection reestablished - after session timeout (believe (kinda hope) that a session timeout has to have happened before Solr needs to go into recovery after a ZK connection loss)
When it gets prioritized on my side, I will try to investigate further what causes the log to claim that many recoveries goes on for the same shard concurrently.
When a new node is elected as a leader by ZooKeeper it first tries to do a peer sync against every other live node. So lets say the first node in your two node situation comes back and he is behind the other node, but he comes back first and is elected leader. The second node has the latest updates, but is second in line to be leader and a few updates ahead. The potential leader will try and peer sync with the other node and get those missing updates if it's fewer than 100 or fail because the other node is ahead by too much.
Well we shouldnt let this issue (SOLR-3721) become about many other issues, but when the "behind" node has reconnected and become leader and the one with the latest updates does not come back live right away, isnt the new leader (which is behind) allowed to start handling update-requests. If yes, then it will be possible that both shards have documents/updates that the other one doesnt, and it is possible to come up with scenarios where there is no good algorithm for generating the "correct" merged union of the data in both shards. So what to do when the other shard (which used to have a later version than the current leader) comes live? Believe there is nothing solid to do!
How to avoid that? I was thinking about keeping the latest version for every slice in ZK, so that a "behind" shard will know if it has the latest version of a slice, and therefore if it is allowed to take the role as leader. Of course the writing of this "latest version" to ZK and the writing of the corresponding update in leaders transaction-log would have to be atomic (like the A in ACID) as much as possible. And it would be nice if writing of the update in replica transaction-log would also be atomic with the leader-writing and the ZK writing, in order to increase the chance that a replica is actually allowed to take over the leader role if the leader dies (or both dies and replica comes back first). But all that is just an idea on top of my head.
Do you already have a solution implemented or a solution on the drawing board or how do you/we prevent such a problem? As far as I understand "the drill" during leader-election/recovery (whether its peer-sync or file-copy-replication) from the little code-reading I have done and from what you explain, there is not a current solution. But I might be wrong?
The other node, being the next in line to be leader, will now try and peer sync with the other nodes in the shard
Guess/hope you mean "...with the other shards (running on different nodes) in the slice". As I understand Solr terminology a logical chunk of the "entire data" (a collection in Solr) is a "slice", and the data in a slice might physically exist more than one place (in more shards - if replication is used). Back when I started my interest in Solr I used a considerable amount of time understanding Solr terminology - mainly because it is different that what I have been used to (in my pre-Solr-world a "shard" is what you call a "slice") - so now please dont tell me that I misunderstood
That is the current process - we will continually be hardening and improving it I'm sure.
I will probably stick around for that. The correctness and robusteness of this live-replication feature is (currently) very important to us.