Description
Right now in PeerSync (during leader election), in case of exception on requesting versions to a node, we will skip that node if exception is one the following type
- ConnectTimeoutException
- NoHttpResponseException
- SocketException
Sometime the other node basically hang but still accept connection. In that case SocketTimeoutException is thrown and we consider the PeerSync process as failed and the whole shard just basically leaderless forever (as long as the hang node still there).
We can't just blindly adding SocketTimeoutException to above list, since shalin mentioned that sometimes timeout can happen because of genuine reasons too e.g. temporary GC pause.
I think the general idea here is we obey leaderVoteWait restriction and retry doing sync with others in case of connection/timeout exception happen.