On a stress cluster we saw the following sequence of events following a service restart while under load:
- a peer is elected leader successfully
- both of its followers have divergent logs
- when it connects to a new peer with a divergent log, it decides to fall back to index 0 rather than falling back to the proper committed index of that peer
- upon falling back to index 0, will never succeed since the first segment of the log was already GCed long ago.
Thus, the leader thinks that it needs to evict both of the followers and can't replicate to them, and the tablet gets "stuck".