• Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.3.0
    • SCM


      For Ratis, the number of replicas which must be available when a node goes into maintenance is a simple integer defaulting to 2 in hdds.scm.replication.maintenance.replica.minimum.

      This means that for a Ratis container, one out of the 3 nodes can be offline without any replication happening. This can be set to 1, letting two go offline or 3 ensuring full redundancy and hence replication when any node is taken offline.

      It could be argued that 1 would be a better default here. With the default placement of 2 replicas on one rack and 1 on another rack, that should allow for a full rack to be taken offline without replication.

      For EC, its a little more tricky. Aside from Ratis 1 containers, which are rarely used in practice, EC can tolerate 2 offline (for 3-2), 3 (for 6-3) or 4 (for 10-4).

      If we use the same default of 2, that means replication will always be required for 3-2 containers. Also the "number of replicas online" doesn't make as much sense for EC, as each replica is not identical.

      EC is also slightly more tricky - when any of the data copies are offline, online reconctruction must be used to read the data, causing a performance penalty, but that cannot be avoided.

      If we take the Ratis default of 2 - when there are two replicas out of 3 online, then we have a remaining redundancy of 1 - ie we can afford to lose one more copy and still read data.

      If we change the Ratis setting to 1, there is a remaining redundancy of 0, because the loss of another replica renders the data unreadable.

      For EC, if we default the setting to a "remaining redundancy" of 1, this would mean we can tolerate a loss of 1 more replicas and still read the data.

      This would allow for 3-2 to have 1 replica offline, 6-3 could have 2 and 10-4 could have 3 without any replicaion. In all cases the data redundancy is the same as with Ratis having 2 containers offline.

      Additionally, its highly likely online recovery will be needed to read the data, eg if 1 container is offline in 10-4 there is a 10 in 14 (5 in 7) chance its a data container, so trying to keep more containers online for larger EC groups is probably not going to help performance much.

      In a large cluster, ideally EC containers will be spread across racks such that there is only 1 replia per rack, so taking a full rack offline would only reduce the redundancy by 1 meaning even 3-2 containers could tolerate a rack going into maintenance.

      In summary, I believe the simplest solution, is to have an EC setting hdds.scm.replication.maintenance.ec.remaining.redundancy = 1 which we use for maintenance of EC containers and is basically equivalent to the Ratis default of 2. It may make sense to call the new parameter hdds.scm.replication.maintenance.remaining.redundancy and use the same value for both Ratis and EC, deprecating the old value.


        Issue Links



              sodonnell Stephen O'Donnell
              sodonnell Stephen O'Donnell
              0 Vote for this issue
              1 Start watching this issue