Dan found this while working on Kudu training material.
Suppose you have a three node cluster and a table with a singleton tablet (replicated three times). Now suppose you stopped one tserver, deleted all of its on-disk data, then restarted it.
You would expect the following:
- The tablet's leader replica can no longer reach the replica on the reformatted tserver.
- The leader will evict that replica.
- The master will notice the tablet's under-replication and ask the leader to add a new replica, probably on the reformatted node.
Instead, there's no eviction at all. The leader replica keeps spewing messages like this in its log:
W0913 14:13:18.411238 22597 consensus_peers.cc:332] T 89dfba0c0a714259acf69d9f611e1e92 P 1540ac6e6cb44c2c9f9c6c6c98fd61f7 -> Peer cc2ef23f1c2c42b7a6a02d7183d92884 (dan-test-g-2.gce.cloudera.com:7050): Couldn't send request to peer cc2ef23f1c2c42b7a6a02d7183d92884 for tablet 89dfba0c0a714259acf69d9f611e1e92. Error code: WRONG_SERVER_UUID (16). Status: Invalid argument: UpdateConsensus: Wrong destination UUID requested. Local UUID: ef3ea81d59fc4a91b754cfe63b21e6ee. Requested UUID: cc2ef23f1c2c42b7a6a02d7183d92884. Retrying in the next heartbeat period. Already tried 5821 times.
Having looked at the code responsible for starting replica eviction (PeerMessageQueue::RequestForPeer) and the code spewing that error (Peer::ProcessResponseError), I think I see what's going on. The eviction code in RequestforPeer() checks the peer's "last successful communication time" to decide whether to evict or not. Intuitively you'd expect that time to be updated only when the peer responds successfully, but there are a couple cases in Peer::ProcessResponseError where we update the last communication time anyway. Notably:
- If the RPC controller yielded a RemoteError, or
- If the RPC controller had no error but the response itself contained an error, and the error's code was not TABLET_NOT_FOUND, or
- If the RPC controller and the response had no error, but the response's status had an error, and that error's code was CANNOT_PREPARE.
I think we're hitting case #2, because there should be no RPC controller error (the reformatted tserver did respond to the leader replica), but the response does contain a WRONG_SERVER_UUID error.