Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
0.7.0
-
None
Description
This is an issue that decster noticed on his ~70 node cluster. When a server hosting many tablets goes down, each of those tablets has to create new replicas elsewhere. We would expect in a 70-node cluster that all other nodes would participate in recovery in order to minimize the recovery time. However, we found that only a small number of nodes acted as 'sources' for making new tablet replicas.
The issue is that the master currently assigns replicas in a strict round-robin. So, if we have a cluster with a number of TS which is a multiple of three, this means that we end up with servers
{A,B,C}having the same set of replicas, servers
{D,E,F}having another set, etc. So, if a server fails, only two servers can act as re-replication sources. If the number of servers is not a multiple of three, the problem is not quite as bad, but still limited to 4 (the two "adjacent" servers).
The master should spread out the replicas more randomly so that when a server goes down, a large number of other servers can act as sources for re-replication.