Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
For Ratis, the number of replicas which must be available when a node goes into maintenance is a simple integer defaulting to 2 in hdds.scm.replication.maintenance.replica.minimum.
This means that for a Ratis container, one out of the 3 nodes can be offline without any replication happening. This can be set to 1, letting two go offline or 3 ensuring full redundancy and hence replication when any node is taken offline.
It could be argued that 1 would be a better default here. With the default placement of 2 replicas on one rack and 1 on another rack, that should allow for a full rack to be taken offline without replication.
For EC, its a little more tricky. Aside from Ratis 1 containers, which are rarely used in practice, EC can tolerate 2 offline (for 3-2), 3 (for 6-3) or 4 (for 10-4).
If we use the same default of 2, that means replication will always be required for 3-2 containers. Also the "number of replicas online" doesn't make as much sense for EC, as each replica is not identical.
EC is also slightly more tricky - when any of the data copies are offline, online reconctruction must be used to read the data, causing a performance penalty, but that cannot be avoided.
If we take the Ratis default of 2 - when there are two replicas out of 3 online, then we have a remaining redundancy of 1 - ie we can afford to lose one more copy and still read data.
If we change the Ratis setting to 1, there is a remaining redundancy of 0, because the loss of another replica renders the data unreadable.
For EC, if we default the setting to a "remaining redundancy" of 1, this would mean we can tolerate a loss of 1 more replicas and still read the data.
This would allow for 3-2 to have 1 replica offline, 6-3 could have 2 and 10-4 could have 3 without any replicaion. In all cases the data redundancy is the same as with Ratis having 2 containers offline.
Additionally, its highly likely online recovery will be needed to read the data, eg if 1 container is offline in 10-4 there is a 10 in 14 (5 in 7) chance its a data container, so trying to keep more containers online for larger EC groups is probably not going to help performance much.
In a large cluster, ideally EC containers will be spread across racks such that there is only 1 replia per rack, so taking a full rack offline would only reduce the redundancy by 1 meaning even 3-2 containers could tolerate a rack going into maintenance.
In summary, I believe the simplest solution, is to have an EC setting hdds.scm.replication.maintenance.ec.remaining.redundancy = 1 which we use for maintenance of EC containers and is basically equivalent to the Ratis default of 2. It may make sense to call the new parameter hdds.scm.replication.maintenance.remaining.redundancy and use the same value for both Ratis and EC, deprecating the old value.
Attachments
Issue Links
- causes
-
HDDS-7201 testContainerIsReplicatedWhenAllNodesGotoMaintenance is failing frequently
- Resolved
- links to