Dan found this while working on Kudu training material.
Suppose you have a three node cluster and a table with a singleton tablet (replicated three times). Now suppose you stopped one tserver, deleted all of its on-disk data, then restarted it.
You would expect the following:
- The tablet's leader replica can no longer reach the replica on the reformatted tserver.
- The leader will evict that replica.
- The master will notice the tablet's under-replication and ask the leader to add a new replica, probably on the reformatted node.
Instead, there's no eviction at all. The leader replica keeps spewing messages like this in its log:
Having looked at the code responsible for starting replica eviction (PeerMessageQueue::RequestForPeer) and the code spewing that error (Peer::ProcessResponseError), I think I see what's going on. The eviction code in RequestforPeer() checks the peer's "last successful communication time" to decide whether to evict or not. Intuitively you'd expect that time to be updated only when the peer responds successfully, but there are a couple cases in Peer::ProcessResponseError where we update the last communication time anyway. Notably:
- If the RPC controller yielded a RemoteError, or
- If the RPC controller had no error but the response itself contained an error, and the error's code was not TABLET_NOT_FOUND, or
- If the RPC controller and the response had no error, but the response's status had an error, and that error's code was CANNOT_PREPARE.
I think we're hitting case #2, because there should be no RPC controller error (the reformatted tserver did respond to the leader replica), but the response does contain a WRONG_SERVER_UUID error.