[KUDU-1317] Tablet re-replication is not well spread across a cluster - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.7.0
Fix Version/s: 0.7.0
Component/s: master
Labels:
None

Target Version/s:

0.7.0
Code Review:
http://gerrit.cloudera.org:8080/#/c/1654/

Description

This is an issue that decster noticed on his ~70 node cluster. When a server hosting many tablets goes down, each of those tablets has to create new replicas elsewhere. We would expect in a 70-node cluster that all other nodes would participate in recovery in order to minimize the recovery time. However, we found that only a small number of nodes acted as 'sources' for making new tablet replicas.

The issue is that the master currently assigns replicas in a strict round-robin. So, if we have a cluster with a number of TS which is a multiple of three, this means that we end up with servers

{A,B,C}

having the same set of replicas, servers

{D,E,F}

having another set, etc. So, if a server fails, only two servers can act as re-replication sources. If the number of servers is not a multiple of three, the problem is not quite as bad, but still limited to 4 (the two "adjacent" servers).

The master should spread out the replicas more randomly so that when a server goes down, a large number of other servers can act as sources for re-replication.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

placement.py
26/Jan/16 21:48
1.0 kB
Todd Lipcon

Activity

People

Assignee:: Todd Lipcon

Reporter:: Todd Lipcon

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Jan/16 21:46

Updated:: 29/Jan/16 23:52

Resolved:: 29/Jan/16 23:52