Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-1094

Intelligent block placement policy to decrease probability of block loss

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:
      None

      Description

      The current HDFS implementation specifies that the first replica is local and the other two replicas are on any two random nodes on a random remote rack. This means that if any three datanodes die together, then there is a non-trivial probability of losing at least one block in the cluster. This JIRA is to discuss if there is a better algorithm that can lower probability of losing a block.

      1. calculate_probs.py
        2 kB
        Rodrigo Schmidt
      2. failure_rate.py
        11 kB
        Rodrigo Schmidt
      3. prob.pdf
        67 kB
        Aravind Menon
      4. prob.pdf
        67 kB
        Aravind Menon

        Issue Links

          Activity

          Hide
          dhruba dhruba borthakur added a comment -

          One simple solution that come to mind. Let's arrange all possible racks in a cluster and number them in a logical fashion 0, 1, 2, ...n

          1. The first replica is on the local node on rack r. Then the other two replicas be randomly selected nodes on either rack r-1 or r+1. In this approach, three datanodes in two consecutive racks have to fail simultaneously for a block loss to occur. This is better than the current implementation where any three datanode failures in the entire cluster can cause one block to be lost.

          2. The first replica is on the local node on rack r. Let's say that the local node is the 6th node in the local rack. Then the other two replicas of this block will be also reside on the 6th node of any randomly selected remote racks. In this approach, any three datanodes in the same p-th position in a rack has to fail for one block to be lost.

          Show
          dhruba dhruba borthakur added a comment - One simple solution that come to mind. Let's arrange all possible racks in a cluster and number them in a logical fashion 0, 1, 2, ...n 1. The first replica is on the local node on rack r. Then the other two replicas be randomly selected nodes on either rack r-1 or r+1. In this approach, three datanodes in two consecutive racks have to fail simultaneously for a block loss to occur. This is better than the current implementation where any three datanode failures in the entire cluster can cause one block to be lost. 2. The first replica is on the local node on rack r. Let's say that the local node is the 6th node in the local rack. Then the other two replicas of this block will be also reside on the 6th node of any randomly selected remote racks. In this approach, any three datanodes in the same p-th position in a rack has to fail for one block to be lost.
          Hide
          acmurthy Arun C Murthy added a comment -

          Dhruba, doing more than one off-switch write might have significant negative performance side-effect on the write-pipeline.

          Shouldn't HDFS-385 be sufficient to address the current jira via a plug-in for a 3-rack block placement? Or is this jira meant to be about a plug-in?

          Show
          acmurthy Arun C Murthy added a comment - Dhruba, doing more than one off-switch write might have significant negative performance side-effect on the write-pipeline. Shouldn't HDFS-385 be sufficient to address the current jira via a plug-in for a 3-rack block placement? Or is this jira meant to be about a plug-in?
          Hide
          dhruba dhruba borthakur added a comment -

          This is more of a discussion on how to design a non-default policy, I do not plan to change the default policy of having only one off-switch write.

          Also, option 1 has only one off-switch write.

          Show
          dhruba dhruba borthakur added a comment - This is more of a discussion on how to design a non-default policy, I do not plan to change the default policy of having only one off-switch write. Also, option 1 has only one off-switch write.
          Hide
          bockelman Brian Bockelman added a comment -

          Hey Dhruba,

          How does this change the probability? Are you saying that P(rack r and rack r+1 fails) < P(rack i and rack j, i!=j) fails? Can you sketch out mathematically how this decreases the probability of failure?

          Brian

          Show
          bockelman Brian Bockelman added a comment - Hey Dhruba, How does this change the probability? Are you saying that P(rack r and rack r+1 fails) < P(rack i and rack j, i!=j) fails? Can you sketch out mathematically how this decreases the probability of failure? Brian
          Hide
          karthik.ranga Karthik Ranganathan added a comment -

          The issue I see with 1 (which may not be a big deal) is that if more racks are added, then the placement policy cannot utilize the new, emptier machines more efficiently without re-balancing data. Would a hybrid help? Replica 1 is chosen at random from (r-1) and the next replica using scheme 2 (excluding the node on rack r-1).

          @Brian:
          I think this change protects against node failures more than rack failures. Lets say the replication factor is three. If you calculate the probability of data loss if any 3 machines fail in the cluster at the same time, then the placement matters:

          1) If the blocks are scattered across all the machines, then any 3 machines failing will contribute towards data loss.
          2) If a given block (by virtue of the new replication scheme) is placed only on a subset of machines, then the probability of data loss is reduced from any 3 machines failing to 3 machines in that subset failing. Failures of machines in the distinct subsets don't contribute to data loss.

          Show
          karthik.ranga Karthik Ranganathan added a comment - The issue I see with 1 (which may not be a big deal) is that if more racks are added, then the placement policy cannot utilize the new, emptier machines more efficiently without re-balancing data. Would a hybrid help? Replica 1 is chosen at random from (r-1) and the next replica using scheme 2 (excluding the node on rack r-1). @Brian: I think this change protects against node failures more than rack failures. Lets say the replication factor is three. If you calculate the probability of data loss if any 3 machines fail in the cluster at the same time, then the placement matters: 1) If the blocks are scattered across all the machines, then any 3 machines failing will contribute towards data loss. 2) If a given block (by virtue of the new replication scheme) is placed only on a subset of machines, then the probability of data loss is reduced from any 3 machines failing to 3 machines in that subset failing. Failures of machines in the distinct subsets don't contribute to data loss.
          Hide
          bockelman Brian Bockelman added a comment -

          Hey Karthik,

          Let me play dumb (it might not be playing after all) and try to work out the math a bit.

          First, let's assume that on any given day, a node has 1/1000 chance of failing.

          CURRENT SCHEME: A block is on 3 random nodes. Probability of loss is a simultaneous failure of nodes X, Y, Z. Let's assume these are independent. P(X and Y and Z) = P(X) P(Y) P(Z) = 1 in a billion.

          PROPOSED SCHEME: Well, the probability is the same.

          So, given a specific block, we don't change the probability it is lost.

          What you seem to be calculating is the probability that three nodes go down out of N nodes:

          P(nodes X, Y, and Z fail for any three distinct X, Y, Z) = 1 - P(N-3 nodes stay up) = 1 - [999/1000]^[N-3]

          Sure enough, if you use a small subset (N=40 maybe), then the probability of 3 nodes failing is smaller for small subsets than the whole cluster.

          However, that's not the number you want! You want the probability that any block is lost when three nodes go down. That is, P(nodes X, Y, and Z fail for any three distinct X, Y, Z and X, Y, Z have at least one distinct block) (call this P_1). Assuming that overlapping blocks, node death, and subset of nodes are all independent, you get:

          P_1 = P(three nodes having at least one common block) * P(3 node death) * (# of distinct 3-node subsets)

          The first number is decreasing with N, the second is constant with N, the third is increasing with N. The third is a well-known formula, while I don't have a good formula for the first value. Unless you can calculate or estimate the first, I don't think you can really say anything about decreasing the value of P_1.

          I think we are incorrectly assuming the probability of data loss as being proportional to to the probability of 3 machines in a subset being lost without taking into account the probability of common blocks. The probabilities get tricky, hence me asking for someone to sketch it out mathematically...

          Show
          bockelman Brian Bockelman added a comment - Hey Karthik, Let me play dumb (it might not be playing after all) and try to work out the math a bit. First, let's assume that on any given day, a node has 1/1000 chance of failing. CURRENT SCHEME: A block is on 3 random nodes. Probability of loss is a simultaneous failure of nodes X, Y, Z. Let's assume these are independent. P(X and Y and Z) = P(X) P(Y) P(Z) = 1 in a billion. PROPOSED SCHEME: Well, the probability is the same. So, given a specific block, we don't change the probability it is lost. What you seem to be calculating is the probability that three nodes go down out of N nodes: P(nodes X, Y, and Z fail for any three distinct X, Y, Z) = 1 - P(N-3 nodes stay up) = 1 - [999/1000] ^ [N-3] Sure enough, if you use a small subset (N=40 maybe), then the probability of 3 nodes failing is smaller for small subsets than the whole cluster. However, that's not the number you want! You want the probability that any block is lost when three nodes go down. That is, P(nodes X, Y, and Z fail for any three distinct X, Y, Z and X, Y, Z have at least one distinct block) (call this P_1). Assuming that overlapping blocks, node death, and subset of nodes are all independent, you get: P_1 = P(three nodes having at least one common block) * P(3 node death) * (# of distinct 3-node subsets) The first number is decreasing with N, the second is constant with N, the third is increasing with N. The third is a well-known formula, while I don't have a good formula for the first value. Unless you can calculate or estimate the first, I don't think you can really say anything about decreasing the value of P_1. I think we are incorrectly assuming the probability of data loss as being proportional to to the probability of 3 machines in a subset being lost without taking into account the probability of common blocks. The probabilities get tricky, hence me asking for someone to sketch it out mathematically...
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          I think Brian is right. Dhruba's proposal doesn't change the probability of data loss given that F machines fail.

          Where you place blocks will only impact the failure probability of that block, and only if nodes have different probabilities of failure per machine. Placing blocks on a circular arrangement (Dhruba's proposal) or randomly doesn't affect the expected number of missing blocks given F failures because they are both uniform placement strategies. And if we go for a non-uniform placement strategy, we will only make the worst case worse. What I mean is that the current block placement strategy is already optimal and we can't make the expected number of missing blocks better by changing just block placement. We can only improve it by changing the redundancy scheme that we use (e.g., adding correction codes or more replicas).

          If you are wondering what is the expected number of blocks that go missing given F failures, here is the math:

          Consider a r-way replication scheme and the following variables....

          r = number of replicas used
          F = number of machines that fail concurrently
          N = number of machines
          B = number of blocks (without counting replication)

          I use the prefix notation C(N,a) to denote the number of ways you can pick a elements out of N.

          With N machines, we have C(N,r) possible replica placements for a single block. Given that F machines fail, we have C(F,r) possible placements within those F failed machines.

          If the block placement strategy is uniform (i.e., it doesn't favor some specific nodes or racks more than others), the probability of a block having all its replicas within the failed machines is

          C(F,r) / C(N,r) =

          (F) * (F-1) * (F-2) * ... * (F - r + 1)
          -----------------------------------------------
          (N) * (N-1) * (N-2) * ... * (N - r + 1)

          The expected number of missing blocks when F machines fail should be

          B * [formula above]

          If B=30M, N = 1000 and F = 3, r=3, the expected number of missing blocks would be about 0.18

          Please someone correct if my math or rationale is wrong.

          Show
          rschmidt Rodrigo Schmidt added a comment - I think Brian is right. Dhruba's proposal doesn't change the probability of data loss given that F machines fail. Where you place blocks will only impact the failure probability of that block, and only if nodes have different probabilities of failure per machine. Placing blocks on a circular arrangement (Dhruba's proposal) or randomly doesn't affect the expected number of missing blocks given F failures because they are both uniform placement strategies. And if we go for a non-uniform placement strategy, we will only make the worst case worse. What I mean is that the current block placement strategy is already optimal and we can't make the expected number of missing blocks better by changing just block placement. We can only improve it by changing the redundancy scheme that we use (e.g., adding correction codes or more replicas). If you are wondering what is the expected number of blocks that go missing given F failures, here is the math: Consider a r-way replication scheme and the following variables.... r = number of replicas used F = number of machines that fail concurrently N = number of machines B = number of blocks (without counting replication) I use the prefix notation C(N,a) to denote the number of ways you can pick a elements out of N. With N machines, we have C(N,r) possible replica placements for a single block. Given that F machines fail, we have C(F,r) possible placements within those F failed machines. If the block placement strategy is uniform (i.e., it doesn't favor some specific nodes or racks more than others), the probability of a block having all its replicas within the failed machines is C(F,r) / C(N,r) = (F) * (F-1) * (F-2) * ... * (F - r + 1) ----------------------------------------------- (N) * (N-1) * (N-2) * ... * (N - r + 1) The expected number of missing blocks when F machines fail should be B * [formula above] If B=30M, N = 1000 and F = 3, r=3, the expected number of missing blocks would be about 0.18 Please someone correct if my math or rationale is wrong.
          Hide
          bockelman Brian Bockelman added a comment -

          A funny related anecdote that I've heard third-hand. I could never trace down the authenticity, but I found it amusing -

          A large physics experiment once tried to track down and classify as many errors in their simulation software was possible. After they removed the known source of errors, they took the remaining unreproducible errors and mapped them to the worker nodes. Then, they took the list of worker nodes and mapped to where they were in the machine room. Sure enough, all the unreproducible errors could be tracked to the top two nodes in the rack.

          So, if you put all the copies at the same height on the rack, the probability of losing the files at the top of the rack is definitely higher than the probability of losing the bottom of the rack.

          Show
          bockelman Brian Bockelman added a comment - A funny related anecdote that I've heard third-hand. I could never trace down the authenticity, but I found it amusing - A large physics experiment once tried to track down and classify as many errors in their simulation software was possible. After they removed the known source of errors, they took the remaining unreproducible errors and mapped them to the worker nodes. Then, they took the list of worker nodes and mapped to where they were in the machine room. Sure enough, all the unreproducible errors could be tracked to the top two nodes in the rack. So, if you put all the copies at the same height on the rack, the probability of losing the files at the top of the rack is definitely higher than the probability of losing the bottom of the rack.
          Hide
          karthik.ranga Karthik Ranganathan added a comment -

          I think there is a slight change in the way probability should be calculated if the block placement policy enforces that certain blocks reside on a subset of machines. Nevertheless, I went with the probability of losing data as opposed to the expected number of block losses.

          Scheme 1 - pick any machine and put blocks there. Further, assume that f = r in your example.

          P(of losing data given r failures)
          = P(of losing at least 1 block)
          = 1 - P(of not losing any block)
          = 1 - (P(of not losing a specific block) ^ B)
          = 1 - ((1 - 1/C(N,r)) ^ B)

          Scheme 2 - assume that you have a fixed pool of machines that you replicate blocks to. For simplicity, I am going to assume what this means is that there are K machines that contain a set of blocks and all their replicas. So there are (N/K) such sets of machines. Further, assuming an even distribution, there are only B/(N/K) blocks in this set of K machines.

          P(of losing data given r failures)
          = P(r failures being in one set of K machines) * P(of losing at least 1 block in that set)

          P(r failures being in one set of K machines) = C(N/K,1)*C(K,r)/C(N,r)
          P(of losing at least 1 block in that set) = 1 - ((1 - 1/C(K,r)) ^ (B/(N/K))) --> this follows from the fact that there are K nodes and B/(N/K) blocks.

          Plugging in B=30M, N = 1000 and F = 3, r=3, K=60 (replicate all blocks in the previous and next rack, 20 machines per rack):

          Scheme 1 : P(data loss) = 1 - ((1 - 1/C(1000,3)) ^30) = 0.165
          Scheme 2 : P(data loss) = P(r failures being in one set of K machines)*P(of losing at least 1 block in that set) = 0.0034 * 1 = 0.0034

          Am I doing something wrong?

          Show
          karthik.ranga Karthik Ranganathan added a comment - I think there is a slight change in the way probability should be calculated if the block placement policy enforces that certain blocks reside on a subset of machines. Nevertheless, I went with the probability of losing data as opposed to the expected number of block losses. Scheme 1 - pick any machine and put blocks there. Further, assume that f = r in your example. P(of losing data given r failures) = P(of losing at least 1 block) = 1 - P(of not losing any block) = 1 - (P(of not losing a specific block) ^ B) = 1 - ((1 - 1/C(N,r)) ^ B) Scheme 2 - assume that you have a fixed pool of machines that you replicate blocks to. For simplicity, I am going to assume what this means is that there are K machines that contain a set of blocks and all their replicas. So there are (N/K) such sets of machines. Further, assuming an even distribution, there are only B/(N/K) blocks in this set of K machines. P(of losing data given r failures) = P(r failures being in one set of K machines) * P(of losing at least 1 block in that set) P(r failures being in one set of K machines) = C(N/K,1)*C(K,r)/C(N,r) P(of losing at least 1 block in that set) = 1 - ((1 - 1/C(K,r)) ^ (B/(N/K))) --> this follows from the fact that there are K nodes and B/(N/K) blocks. Plugging in B=30M, N = 1000 and F = 3, r=3, K=60 (replicate all blocks in the previous and next rack, 20 machines per rack): Scheme 1 : P(data loss) = 1 - ((1 - 1/C(1000,3)) ^30) = 0.165 Scheme 2 : P(data loss) = P(r failures being in one set of K machines)*P(of losing at least 1 block in that set) = 0.0034 * 1 = 0.0034 Am I doing something wrong?
          Hide
          karthik.ranga Karthik Ranganathan added a comment -

          Just synced up with some folks (including Rodrigo), I think the mismatch was from the following:

          The expected number of blocks lost is the same in both cases. This is because when 3 nodes die:

          In scheme 1, there is a high probability data is lost, and a little of it is lost when it happens.
          In scheme 2, there is a low probability data is lost, but more of it is lost when it happens.

          The product of the probability and the number of blocks lost is the same. That gives us 2 choices - reduce the probability or the number of blocks lost.

          The underlying issue is that any data loss is bad be it a little or a lot (especially for some kinds of applications). So we are better off lowering the probability that any data is lost than the number of blocks lost at a time. From that perspective, this is a good change.

          Show
          karthik.ranga Karthik Ranganathan added a comment - Just synced up with some folks (including Rodrigo), I think the mismatch was from the following: The expected number of blocks lost is the same in both cases. This is because when 3 nodes die: In scheme 1, there is a high probability data is lost, and a little of it is lost when it happens. In scheme 2, there is a low probability data is lost, but more of it is lost when it happens. The product of the probability and the number of blocks lost is the same. That gives us 2 choices - reduce the probability or the number of blocks lost. The underlying issue is that any data loss is bad be it a little or a lot (especially for some kinds of applications). So we are better off lowering the probability that any data is lost than the number of blocks lost at a time. From that perspective, this is a good change.
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          Yes, if we want to minimize the chances of having ANY data loss, restricting block replication to a subset of the machines is more effective.

          Show
          rschmidt Rodrigo Schmidt added a comment - Yes, if we want to minimize the chances of having ANY data loss, restricting block replication to a subset of the machines is more effective.
          Hide
          bockelman Brian Bockelman added a comment -

          Hey Karthik,

          Why don't we take this to the extreme and set K=r? You can do this without introducing a second off-rack copy in many cases.

          Then, the probability of any data loss is fixed at 1/C(N, r); for your sample case, this is 6.01e-8, quite small.

          Brian

          Show
          bockelman Brian Bockelman added a comment - Hey Karthik, Why don't we take this to the extreme and set K=r? You can do this without introducing a second off-rack copy in many cases. Then, the probability of any data loss is fixed at 1/C(N, r); for your sample case, this is 6.01e-8, quite small. Brian
          Hide
          karthik.ranga Karthik Ranganathan added a comment -

          Hey Brian,

          Yes, that would help with the data loss case but has some unfortunate side effects:

          1. Data availability would be reduced - cannot access the data at all on inter-rack switch failure.
          2. It would cause the data to be very unevenly distributed across the various racks.

          Another workaround for this is to have many smaller clusters instead of one large cluster, but that becomes a maintenance nightmare (operationally), and the application above should balance the data across these clusters.

          Show
          karthik.ranga Karthik Ranganathan added a comment - Hey Brian, Yes, that would help with the data loss case but has some unfortunate side effects: 1. Data availability would be reduced - cannot access the data at all on inter-rack switch failure. 2. It would cause the data to be very unevenly distributed across the various racks. Another workaround for this is to have many smaller clusters instead of one large cluster, but that becomes a maintenance nightmare (operationally), and the application above should balance the data across these clusters.
          Hide
          shv Konstantin Shvachko added a comment -

          > P(of not losing a specific block) = 1 - 1/C(N,r)

          Is that right?
          I think you are throwing one replica at a time on the cluster. The probability of first missing the failed nodes is (N-r)/N. The probability of the second falling into live node excluding the one that already has the first replica is (N-r-1)/N. And so on.

          It would be good if somebody could combine calculations in a pdf (or html) doc and attach to this jira. We can link it from FAQ then.

          Show
          shv Konstantin Shvachko added a comment - > P(of not losing a specific block) = 1 - 1/C(N,r) Is that right? I think you are throwing one replica at a time on the cluster. The probability of first missing the failed nodes is (N-r)/N. The probability of the second falling into live node excluding the one that already has the first replica is (N-r-1)/N. And so on. It would be good if somebody could combine calculations in a pdf (or html) doc and attach to this jira. We can link it from FAQ then.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          > I think you are throwing one replica at a time on the cluster. The probability of first missing the failed nodes is (N-r)/N. The probability of the second falling into live node excluding the one that already has the first replica is (N-r-1)/N.

          Shouldn't it be (N-r)/N , (N-r-1)/N-1 , (N-r-2)/N-2 and so on ?
          Similarly, probability that all replicas reside on the failed nodes, it would be
          r/N * (r-1)/(N-1) * ... * 1/(N-r+1) = 1/C(N,r) which is same as in Karthik's formula.

          Show
          jnp Jitendra Nath Pandey added a comment - > I think you are throwing one replica at a time on the cluster. The probability of first missing the failed nodes is (N-r)/N. The probability of the second falling into live node excluding the one that already has the first replica is (N-r-1)/N. Shouldn't it be (N-r)/N , (N-r-1)/N-1 , (N-r-2)/N-2 and so on ? Similarly, probability that all replicas reside on the failed nodes, it would be r/N * (r-1)/(N-1) * ... * 1/(N-r+1) = 1/C(N,r) which is same as in Karthik's formula.
          Hide
          shv Konstantin Shvachko added a comment -

          Yes, I missed to decrement the total number of nodes after the first replica is placed. Should be (N-r-1)/(N-1) instead of (N-r-1)/N, which leads to Karthik's formula. I stand corrected.

          Show
          shv Konstantin Shvachko added a comment - Yes, I missed to decrement the total number of nodes after the first replica is placed. Should be (N-r-1)/(N-1) instead of (N-r-1)/N, which leads to Karthik's formula. I stand corrected.
          Hide
          aravind.menon Aravind Menon added a comment -

          Hi,

          We did some analysis of the data loss probability in HDFS under different block placement schemes and replication factors. We consider two simple placement schemes: 1. random placement, where each data block can be placed at random on any machine, and 2. in-rack placement, where all replicas of a block are placed in the same rack. The detailed analysis of these scenarios is covered in the attached pdf.

          The main observations from the analysis are:

          1. Probability of data loss increases with cluster size
          2. Probability of data loss is lower with in-rack placement than with random placement
          3. Probability of data loss is lower with higher degree of replication

          Regards,
          Aravind

          Show
          aravind.menon Aravind Menon added a comment - Hi, We did some analysis of the data loss probability in HDFS under different block placement schemes and replication factors. We consider two simple placement schemes: 1. random placement, where each data block can be placed at random on any machine, and 2. in-rack placement, where all replicas of a block are placed in the same rack. The detailed analysis of these scenarios is covered in the attached pdf. The main observations from the analysis are: 1. Probability of data loss increases with cluster size 2. Probability of data loss is lower with in-rack placement than with random placement 3. Probability of data loss is lower with higher degree of replication Regards, Aravind
          Hide
          aravind.menon Aravind Menon added a comment -

          Fixed a minor typo in the pdf.

          Aravind

          Show
          aravind.menon Aravind Menon added a comment - Fixed a minor typo in the pdf. Aravind
          Hide
          dhruba dhruba borthakur added a comment -

          From all the above mathematical formulae, can we then say that the policy listed below can be a good first step:

          "The first replica is on a node on rack r. Then the other two replicas be randomly selected nodes on either rack r-1 or r+1. In this approach, three datanodes in two consecutive racks have to fail simultaneously for a block loss to occur. This is better than the current implementation where any three datanode failures in the entire cluster can cause one block to be lost."

          The NetworkTopology class in the namenode already has information about node to rack location.

          Show
          dhruba dhruba borthakur added a comment - From all the above mathematical formulae, can we then say that the policy listed below can be a good first step: "The first replica is on a node on rack r. Then the other two replicas be randomly selected nodes on either rack r-1 or r+1. In this approach, three datanodes in two consecutive racks have to fail simultaneously for a block loss to occur. This is better than the current implementation where any three datanode failures in the entire cluster can cause one block to be lost." The NetworkTopology class in the namenode already has information about node to rack location.
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          For this JIRA I'm implementing a general idea, in which configuration variables determine a subset of the racks that can be used for each block and a subset of machines within that rack.

          However, the NetworkTopology class from Hadoop Common lacks methods to list racks or limit machines within a rack. I think this is fundamental for this JIRA or any other complex block placement policy (NetworkTopology has a very limited API).

          Even though it's possible to circumvent the NetworkTopology limitations by accessing the namenode directly (through FSNameSystem passed as FSClusterStats to BlockPlacementPolicy), this makes the code really ugly.

          I would like to get the general opinion about extending NetworkTopology from hadoop common.

          Show
          rschmidt Rodrigo Schmidt added a comment - For this JIRA I'm implementing a general idea, in which configuration variables determine a subset of the racks that can be used for each block and a subset of machines within that rack. However, the NetworkTopology class from Hadoop Common lacks methods to list racks or limit machines within a rack. I think this is fundamental for this JIRA or any other complex block placement policy (NetworkTopology has a very limited API). Even though it's possible to circumvent the NetworkTopology limitations by accessing the namenode directly (through FSNameSystem passed as FSClusterStats to BlockPlacementPolicy), this makes the code really ugly. I would like to get the general opinion about extending NetworkTopology from hadoop common.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          Quoted from prob.pdf:

          We consider the probability of data loss in a cluster of n machines, containing b data blocks replicated
          randomly with a replication factor of r. Let us call this probability p_loss random(n, b, r).

          It seems that the probability model described in prob.pdf is incorrect. Data loss probability P depends on time T. The longer time the cluster runs, the higher the probability of data loss. The authors seems assuming P is independent from T.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - Quoted from prob.pdf: We consider the probability of data loss in a cluster of n machines, containing b data blocks replicated randomly with a replication factor of r. Let us call this probability p_loss random (n, b, r). It seems that the probability model described in prob.pdf is incorrect. Data loss probability P depends on time T. The longer time the cluster runs, the higher the probability of data loss. The authors seems assuming P is independent from T.
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          There is a number of simplifications there, including the block placement algorithms. This was done to simplify the math.

          We now have better models for the algorithms, which we should post here soon. However, time-dependent failure probabilities would make calculations overly complicated. It would be great if someone could work that out.

          Show
          rschmidt Rodrigo Schmidt added a comment - There is a number of simplifications there, including the block placement algorithms. This was done to simplify the math. We now have better models for the algorithms, which we should post here soon. However, time-dependent failure probabilities would make calculations overly complicated. It would be great if someone could work that out.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          > There is a number of simplifications there, ...

          I see. You may want to include a section in the beginning to list out all the simplifications.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - > There is a number of simplifications there, ... I see. You may want to include a section in the beginning to list out all the simplifications.
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          That might make it hard to start reading, since no context will be given to the reader.

          I think Aravind tried to list the whole set of assumptions in the beginning of section 2. Maybe we just have to make it clearer.

          Show
          rschmidt Rodrigo Schmidt added a comment - That might make it hard to start reading, since no context will be given to the reader. I think Aravind tried to list the whole set of assumptions in the beginning of section 2. Maybe we just have to make it clearer.
          Hide
          knoguchi Koji Noguchi added a comment -

          Just curious. Are you getting missing/corrupt blocks for those with replication of 3 on your clusters?

          Show
          knoguchi Koji Noguchi added a comment - Just curious. Are you getting missing/corrupt blocks for those with replication of 3 on your clusters?
          Hide
          dhruba dhruba borthakur added a comment -

          @Koji: we have files with replication factor of 3. if a large number of datanodes fail at the same time, we do see missing blocks. Sometimes, the datanode process on these machines fail to start even after repeated start-dfs.sh attempts, sometimes the entire machine fails to reboot. Then we have to manually fix a few of those bad datanode machines and make them come online; this fixes the "missing blocks" problem but is a manual process and is painful.

          Show
          dhruba dhruba borthakur added a comment - @Koji: we have files with replication factor of 3. if a large number of datanodes fail at the same time, we do see missing blocks. Sometimes, the datanode process on these machines fail to start even after repeated start-dfs.sh attempts, sometimes the entire machine fails to reboot. Then we have to manually fix a few of those bad datanode machines and make them come online; this fixes the "missing blocks" problem but is a manual process and is painful.
          Hide
          hairong Hairong Kuang added a comment -

          This policy seems to decrease the aggregate read bandwidth of a file.

          Show
          hairong Hairong Kuang added a comment - This policy seems to decrease the aggregate read bandwidth of a file.
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          @Hairong: I'm not sure if I understand your point. Can you explain it a little more?

          Show
          rschmidt Rodrigo Schmidt added a comment - @Hairong: I'm not sure if I understand your point. Can you explain it a little more?
          Hide
          hairong Hairong Kuang added a comment -

          Suppose a file has N blocks. The proposed policy places all N blocks in at most 3 racks, while the current default policy could place them in up to N+1 racks, which represent a bigger aggregate read bandwidth when N>2.

          Show
          hairong Hairong Kuang added a comment - Suppose a file has N blocks. The proposed policy places all N blocks in at most 3 racks, while the current default policy could place them in up to N+1 racks, which represent a bigger aggregate read bandwidth when N>2.
          Hide
          dhruba dhruba borthakur added a comment -

          Hi hairong; does your comment about aggregate read-bandwidth refer to the the use case when there are multiple simultaneous readers of the same file?

          Show
          dhruba dhruba borthakur added a comment - Hi hairong; does your comment about aggregate read-bandwidth refer to the the use case when there are multiple simultaneous readers of the same file?
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          @Hairong: Just a few observations

          1) The policy I'm implementing doesn't fix the number of racks to 3 and I don't think anyone agrees that should be the case. Dhruba mentioned this static 3-rack policy in this discussion, but I guess that was just an example of how block placement can reduce the probability of failure.

          2) There is a clear trade-off here. The more you restrict where you can place blocks, the worse your maximum bandwidth and load balance will tend to be. (because you will have fewer ways to rearrange blocks to maximize these metrics). So, no free lunch.

          3) Still, in some cases, we might want to reduce bandwidth and load balance possibilities for a lower probability of data loss, specially if HDFS is being used to store important data, backed up by some middleware that has a cache layer.

          Having said that, the patch I'm working on is a configurable policy where configuration variables define windows in two spaces, racks and machines within a rack, in which blocks can be placed. Potentially, these windows can be set to be the entire cluster, which is exactly the default block placement policy. In the other extreme, such windows can be defined to be as restrict as only three machines, which will give the lowest probability of data loss but no freedom to improve load balance or bandwidth.

          Please let me know if this is not clear enough.

          Show
          rschmidt Rodrigo Schmidt added a comment - @Hairong: Just a few observations 1) The policy I'm implementing doesn't fix the number of racks to 3 and I don't think anyone agrees that should be the case. Dhruba mentioned this static 3-rack policy in this discussion, but I guess that was just an example of how block placement can reduce the probability of failure. 2) There is a clear trade-off here. The more you restrict where you can place blocks, the worse your maximum bandwidth and load balance will tend to be. (because you will have fewer ways to rearrange blocks to maximize these metrics). So, no free lunch. 3) Still, in some cases, we might want to reduce bandwidth and load balance possibilities for a lower probability of data loss, specially if HDFS is being used to store important data, backed up by some middleware that has a cache layer. Having said that, the patch I'm working on is a configurable policy where configuration variables define windows in two spaces, racks and machines within a rack, in which blocks can be placed. Potentially, these windows can be set to be the entire cluster, which is exactly the default block placement policy. In the other extreme, such windows can be defined to be as restrict as only three machines, which will give the lowest probability of data loss but no freedom to improve load balance or bandwidth. Please let me know if this is not clear enough.
          Hide
          hairong Hairong Kuang added a comment -

          Right, in case of a map/reduce application, multiple mappers are reading the same file simultaneously. Placing the blocks of a file in multiple racks greatly increases the probability of reading from a rack-local node.

          Show
          hairong Hairong Kuang added a comment - Right, in case of a map/reduce application, multiple mappers are reading the same file simultaneously. Placing the blocks of a file in multiple racks greatly increases the probability of reading from a rack-local node.
          Hide
          hairong Hairong Kuang added a comment -

          Rodrigo, thanks for your explanation. Now I understand your proposal much better.

          > We assume machine failures are independent.

          I am not sure if I agree with this assumption. From the attached analysis, the in-rack placement has the lowest data loss probability. This is counter-intuitive. in reality, the chance of losing a rack is not small. So a block placement policy normally place a block in at least two racks.

          Show
          hairong Hairong Kuang added a comment - Rodrigo, thanks for your explanation. Now I understand your proposal much better. > We assume machine failures are independent. I am not sure if I agree with this assumption. From the attached analysis, the in-rack placement has the lowest data loss probability. This is counter-intuitive. in reality, the chance of losing a rack is not small. So a block placement policy normally place a block in at least two racks.
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          The assumption that failures are independent was made to ease the calculations. Assuming that failures are dependent would add too much complexity to the problem.

          In the implementation, we are still trying to keep blocks on separate racks and that's why I said there will be two windows: one for the remote rack to be chosen, and the other to limit the machines that can be chosen inside that rack (for each block that must be replicated).

          Show
          rschmidt Rodrigo Schmidt added a comment - The assumption that failures are independent was made to ease the calculations. Assuming that failures are dependent would add too much complexity to the problem. In the implementation, we are still trying to keep blocks on separate racks and that's why I said there will be two windows: one for the remote rack to be chosen, and the other to limit the machines that can be chosen inside that rack (for each block that must be replicated).
          Hide
          jsensarma Joydeep Sen Sarma added a comment -

          but map-reduce programs rarely read from just one file. as long as the total input data spans most racks in a cluster - that should be good enough.

          data population in most clusters (like in our case) is itself via parallel processes (whether it's crawlers or distcp like parallel copiers). these in turn are consumed by other map-reduce jobs and so on. based on this - i would imagine that most input data sets have a number of files populated via different nodes/writers (and hence likely to span, in totality, a large number of racks in the cluster).

          Show
          jsensarma Joydeep Sen Sarma added a comment - but map-reduce programs rarely read from just one file. as long as the total input data spans most racks in a cluster - that should be good enough. data population in most clusters (like in our case) is itself via parallel processes (whether it's crawlers or distcp like parallel copiers). these in turn are consumed by other map-reduce jobs and so on. based on this - i would imagine that most input data sets have a number of files populated via different nodes/writers (and hence likely to span, in totality, a large number of racks in the cluster).
          Hide
          hairong Hairong Kuang added a comment -

          More questions...

          I understand that the proposed policy tries to reduce the data loss of a file by placing its blocks on a N datanodes, where N is equal to or less than the total number of datanodes in the HDFS cluster. Why does the policy need to limit the number of racks to choose from? It seems to me that the more racks, the better.

          How do you plan to handle the balancer?

          Show
          hairong Hairong Kuang added a comment - More questions... I understand that the proposed policy tries to reduce the data loss of a file by placing its blocks on a N datanodes, where N is equal to or less than the total number of datanodes in the HDFS cluster. Why does the policy need to limit the number of racks to choose from? It seems to me that the more racks, the better. How do you plan to handle the balancer?
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          The higher the number of racks, the higher the latency for the write pipeline. Otherwise, the default block placement policy would use 3 different racks always instead of always having 2 replicas on the same rack.

          I've had some general discussions with Dhruba about the balancer. It doesn't use the block placement policy, so I'll have to extend it separately.

          Show
          rschmidt Rodrigo Schmidt added a comment - The higher the number of racks, the higher the latency for the write pipeline. Otherwise, the default block placement policy would use 3 different racks always instead of always having 2 replicas on the same rack. I've had some general discussions with Dhruba about the balancer. It doesn't use the block placement policy, so I'll have to extend it separately.
          Hide
          hairong Hairong Kuang added a comment -

          I did not mean #racks per block. I mean racks for a file.

          Show
          hairong Hairong Kuang added a comment - I did not mean #racks per block. I mean racks for a file.
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          As Joydeep wrote, we didn't think this was a major problem. What is your proposal to fix that?

          Show
          rschmidt Rodrigo Schmidt added a comment - As Joydeep wrote, we didn't think this was a major problem. What is your proposal to fix that?
          Hide
          hairong Hairong Kuang added a comment -

          For a large file, it does matter especially in the use case of compacting large number of small files (like reduce results) into one by concatenating or archiving.

          Anyway, no matter it matters or not, my question is why you want to have this rack limitation?

          Show
          hairong Hairong Kuang added a comment - For a large file, it does matter especially in the use case of compacting large number of small files (like reduce results) into one by concatenating or archiving. Anyway, no matter it matters or not, my question is why you want to have this rack limitation?
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          I don't think I understand your use case. It only seem to be advantageous to do what you say if there are multiple readers for the same file.

          We designed it this way because it would be relatively easy to understand, implement, generalize, and plan for (as users).

          But I'm quite open to options. What would you propose instead?

          Show
          rschmidt Rodrigo Schmidt added a comment - I don't think I understand your use case. It only seem to be advantageous to do what you say if there are multiple readers for the same file. We designed it this way because it would be relatively easy to understand, implement, generalize, and plan for (as users). But I'm quite open to options. What would you propose instead?
          Hide
          shv Konstantin Shvachko added a comment -

          The math looks good to me (in pdf file).

          > Data loss probability P depends on time T.

          Here the assumption is, correct me if it's wrong, that f nodes fail simultaneously. Otherwise, we should take into account replication process, which will be restoring some blocks while other nodes are still up, decreasing the probability of data loss. Probability of losing f nodes simultaneously at a particular moment does not depend on time. The probability of a simultaneous failure of f nodes during a specific period of time depends on the length of the period. So if you choose the parameter p in the document correctly (depending on the time period), then you get the the probability of a data loss during this period of time.

          The assumption p = 0.01 or 0.001 seems arbitrary, but it probably does not matter as you compare different strategies with the same value.

          What is missing in the analysis is that the probability of loosing a whole rack is much higher than the probability of loosing any 20 machines in the cluster. It should be actually equivalent to the probability of loosing one machine, because you loose one switch and the whole rack is out.
          And that was one of the main reasons why we decided to replicate off rack.
          Rodrigo, did I understand correctly that your idea is to experiment with replication within the rack, that is, all replicas are placed on different machines in the same rack?

          Show
          shv Konstantin Shvachko added a comment - The math looks good to me (in pdf file). > Data loss probability P depends on time T. Here the assumption is, correct me if it's wrong, that f nodes fail simultaneously. Otherwise, we should take into account replication process, which will be restoring some blocks while other nodes are still up, decreasing the probability of data loss. Probability of losing f nodes simultaneously at a particular moment does not depend on time. The probability of a simultaneous failure of f nodes during a specific period of time depends on the length of the period. So if you choose the parameter p in the document correctly (depending on the time period), then you get the the probability of a data loss during this period of time. The assumption p = 0.01 or 0.001 seems arbitrary, but it probably does not matter as you compare different strategies with the same value. What is missing in the analysis is that the probability of loosing a whole rack is much higher than the probability of loosing any 20 machines in the cluster. It should be actually equivalent to the probability of loosing one machine, because you loose one switch and the whole rack is out. And that was one of the main reasons why we decided to replicate off rack. Rodrigo, did I understand correctly that your idea is to experiment with replication within the rack, that is, all replicas are placed on different machines in the same rack?
          Hide
          kannanm Kannan Muthukkaruppan added a comment -

          Konstantin: Re: <<< because you loose one switch and the whole rack is out.>>> The calculations were primarily to reason about data loss (multiple disk failures that could potentially lead to all replicas of a block getting lost) and strategies to deal with those rather than temporary data unavailability.

          Show
          kannanm Kannan Muthukkaruppan added a comment - Konstantin: Re: <<< because you loose one switch and the whole rack is out.>>> The calculations were primarily to reason about data loss (multiple disk failures that could potentially lead to all replicas of a block getting lost) and strategies to deal with those rather than temporary data unavailability.
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          @Konstantin: The new policy will have, for every block, a limited window of racks you can choose from, and a limited window of machines within such racks. For every block, we will keep the idea of having a copy that is local to the writer, and two copies at a remote rack, but always respecting this limited window of choices.

          Show
          rschmidt Rodrigo Schmidt added a comment - @Konstantin: The new policy will have, for every block, a limited window of racks you can choose from, and a limited window of machines within such racks. For every block, we will keep the idea of having a copy that is local to the writer, and two copies at a remote rack, but always respecting this limited window of choices.
          Hide
          hairong Hairong Kuang added a comment -

          > I don't think I understand your use case. It only seem to be advantageous to do what you say if there are multiple readers for the same file. We designed it this way because it would be relatively easy to understand, implement, generalize, and plan for (as users).

          You still do not get my point. I think the goal of this policy is to reduce the data loss by limiting the # of nodes to place a file's data. Is this additional limitation of racks neccessary? Or are you saying this is just easy for you to implement. I do not see how this helps users understand or plan. In general, having less configuration parameters is easier for users to understand or plan.

          Show
          hairong Hairong Kuang added a comment - > I don't think I understand your use case. It only seem to be advantageous to do what you say if there are multiple readers for the same file. We designed it this way because it would be relatively easy to understand, implement, generalize, and plan for (as users). You still do not get my point. I think the goal of this policy is to reduce the data loss by limiting the # of nodes to place a file's data. Is this additional limitation of racks neccessary? Or are you saying this is just easy for you to implement. I do not see how this helps users understand or plan. In general, having less configuration parameters is easier for users to understand or plan.
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          I think I get your point. I don't think you get mine, though. I just want to know what else you have in mind.

          Let me put it this way: If you were to implement this, what would you do?

          Show
          rschmidt Rodrigo Schmidt added a comment - I think I get your point. I don't think you get mine, though. I just want to know what else you have in mind. Let me put it this way: If you were to implement this, what would you do?
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          Just complementing my previous comment, I don't know of a better way to implement this that wouldn't be either overly complicated, or hard to configure and understand. I'm open to other ideas, but you have to give me some.

          Show
          rschmidt Rodrigo Schmidt added a comment - Just complementing my previous comment, I don't know of a better way to implement this that wouldn't be either overly complicated, or hard to configure and understand. I'm open to other ideas, but you have to give me some.
          Hide
          hairong Hairong Kuang added a comment -

          Hi Rodrigo, I do not know your algorithm so I have no idea how relaxing the rack restriction would complicate your implementation.

          Assume that a user wants to place a file's blocks to at most N datanodes and a cluster has R racks, if you place blocks to at most N/R datanodes per rack, is it a special case as your proposal? Of course, there are other algorithms too...

          Show
          hairong Hairong Kuang added a comment - Hi Rodrigo, I do not know your algorithm so I have no idea how relaxing the rack restriction would complicate your implementation. Assume that a user wants to place a file's blocks to at most N datanodes and a cluster has R racks, if you place blocks to at most N/R datanodes per rack, is it a special case as your proposal? Of course, there are other algorithms too...
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          Hairong, would you mind explaining how this approach would improve the probability of data loss? I don't quite see it.

          Show
          rschmidt Rodrigo Schmidt added a comment - Hairong, would you mind explaining how this approach would improve the probability of data loss? I don't quite see it.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          > Here the assumption is, correct me if it's wrong, that f nodes fail simultaneously. Otherwise, we should
          > take into account replication process, ...

          I totally agree that we should take account with replication process. That's also the reason that we cannot ignore time T in the equations. A model only consider data loss (i.e. disk failure) probability P without time T, in my opinion, is not useful because simultaneous multiple disk failure probability in real world is really low, probably negligible. I think it would happen only in site-wise disaster situation.

          Rodrigo, have you observed any case of simultaneous multiple disk failure?

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - > Here the assumption is, correct me if it's wrong, that f nodes fail simultaneously. Otherwise, we should > take into account replication process, ... I totally agree that we should take account with replication process. That's also the reason that we cannot ignore time T in the equations. A model only consider data loss (i.e. disk failure) probability P without time T, in my opinion, is not useful because simultaneous multiple disk failure probability in real world is really low, probably negligible. I think it would happen only in site-wise disaster situation. Rodrigo, have you observed any case of simultaneous multiple disk failure?
          Hide
          jsensarma Joydeep Sen Sarma added a comment -

          @Rodrigo - how about using a term to describe the groups of nodes that we are dividing the cluster into? (I will call it a 'node-group' for now).

          @Hairong - if we drop the rack restriction - then it seems that each node-group will have to have at least 2 X #of racks. This is because we want to 2 nodes per rack for each node-group (for 2 rack local copies for the writer). So we are losing control of the size of the node-group. I think the N racks x M nodes definition is much more flexible - it gives control of how much rack fault tolerance we want (by controlling the parameter N) vs. how much fault isolation/re-replication bandwidth we want (by controlling the overall product NxM).

          One other thing - i think contiguous racks have higher chance of correlated failures (because of shared physical location/temperature etc. as well as the fact that in a heterogenous cluster - new hardware may be added in contiguous racks). So the definition of N if better set to be alternating (or something in that sort).

          One problem not discussed here (I think) is what happens when more racks are added (and the node-group definitions suddenly change - this is particularly problematic if some modulo arithmetic is used to choose racks for a node-group). Does the system re-replicate things to conform to new node-group definitions? (That would be awful). If not - how does it know not to?

          Show
          jsensarma Joydeep Sen Sarma added a comment - @Rodrigo - how about using a term to describe the groups of nodes that we are dividing the cluster into? (I will call it a 'node-group' for now). @Hairong - if we drop the rack restriction - then it seems that each node-group will have to have at least 2 X #of racks. This is because we want to 2 nodes per rack for each node-group (for 2 rack local copies for the writer). So we are losing control of the size of the node-group. I think the N racks x M nodes definition is much more flexible - it gives control of how much rack fault tolerance we want (by controlling the parameter N) vs. how much fault isolation/re-replication bandwidth we want (by controlling the overall product NxM). One other thing - i think contiguous racks have higher chance of correlated failures (because of shared physical location/temperature etc. as well as the fact that in a heterogenous cluster - new hardware may be added in contiguous racks). So the definition of N if better set to be alternating (or something in that sort). One problem not discussed here (I think) is what happens when more racks are added (and the node-group definitions suddenly change - this is particularly problematic if some modulo arithmetic is used to choose racks for a node-group). Does the system re-replicate things to conform to new node-group definitions? (That would be awful). If not - how does it know not to?
          Hide
          jsensarma Joydeep Sen Sarma added a comment -

          regarding T and simulataneous failures:

          • the smaller the node-group - the larger the recovery time (since the time to re-replicate data from a bad-disk increases as only a smaller set of nodes have copies of the bad data). so there's a tension here between fault isolation and recovery time. However - it would seem that there are declining returns to re-replication bandwidth as the size of a node-group is increased.

          it would be useful to know what this curve looks like (time to recovery versus size of node-group) to choose a good cutoff. this may also force one to choose larger number of racks (N) for a node group. I would be much more worried about re-replication bandwidth than IO bandwidth for map-reduce.

          • i think we have seen large number of failures due to bad hardware (faulty set of nodes) or operator error (some set of nodes re-imaged). (which is why i think the node group should not be contiguous in definition)

          there are also events (cluster wide restart/reboot causing several nodes to not boot because of bad boot disk/sector) that cause simulataneous failure.

          as the number of nodes has increased - what i have heard anecdotally is that at any given point in time - significant number of nodes are down and sometimes this causes blocks to go missing. so in that sense - i guess yeah - we do see simultaneous failures.

          there are other strategies to decrease T (and thereby decrease overall failure prob.). If re-replication bandwidth is the constraining factor - a separate network (from map-reduce traffic) for re-replication/balancing may be the right solution. however i don't know if that is the case (in the case of these simultaneous failiures).

          Show
          jsensarma Joydeep Sen Sarma added a comment - regarding T and simulataneous failures: the smaller the node-group - the larger the recovery time (since the time to re-replicate data from a bad-disk increases as only a smaller set of nodes have copies of the bad data). so there's a tension here between fault isolation and recovery time. However - it would seem that there are declining returns to re-replication bandwidth as the size of a node-group is increased. it would be useful to know what this curve looks like (time to recovery versus size of node-group) to choose a good cutoff. this may also force one to choose larger number of racks (N) for a node group. I would be much more worried about re-replication bandwidth than IO bandwidth for map-reduce. i think we have seen large number of failures due to bad hardware (faulty set of nodes) or operator error (some set of nodes re-imaged). (which is why i think the node group should not be contiguous in definition) there are also events (cluster wide restart/reboot causing several nodes to not boot because of bad boot disk/sector) that cause simulataneous failure. as the number of nodes has increased - what i have heard anecdotally is that at any given point in time - significant number of nodes are down and sometimes this causes blocks to go missing. so in that sense - i guess yeah - we do see simultaneous failures. there are other strategies to decrease T (and thereby decrease overall failure prob.). If re-replication bandwidth is the constraining factor - a separate network (from map-reduce traffic) for re-replication/balancing may be the right solution. however i don't know if that is the case (in the case of these simultaneous failiures).
          Hide
          scott_carey Scott Carey added a comment -

          I've had some general discussions with Dhruba about the balancer. It doesn't use the block placement policy, so I'll have to extend it separately.

          To me, the biggest causes of data risk seem to be parts of the system that don't adhere to good placement policy and not the default write placement policy. Furthermore this plan will decrease the probability of individual data loss but increase the amount of data lost when it does happen. Lastly, most clusters these days have 3 to 8 hard drives per node. This policy is blind to that, and it is a big factor in the sort of data loss concerned with here (a disk dead for good, not a temporary unavailability). It would seem that if you are trying to force a file to only exist on 3 nodes, you would want to go all the way down to 3 disks on 3 different machines. In fact, a policy that achieves some of the intended affect to some extent leaves the node placement exactly as it is now but requests that a file being written goes to a particular disk in a node (a large random number mod the number of disks in the node). If there were 4 disks per node, this would decrease the likelihood of individual file data loss by restricting the file placement for any given file to 1/4 of the disks, but the performance would be OK because the file is still distributed uniformly by node.

          Components that make block placement choices:
          Balancer: The balancer must be rack-aware and requires a separate placement policy tradeoff (3 blocks on 3 different racks is a good thing there).

          SetRep: Replication changes to files, and especially lowering of them, cause lots of placement inefficiencies. For example, if you have a large file, or a collection of files and the block count is much larger than the number of nodes, and you decrease the replication (say from 5 to 4), then the end result is that the nodes that had the highest disk usage at the time will now have ZERO blocks from those files.

          Node death / decomission; recommission and new nodes: When a node temporarily dissapears and its missing blocks are replicated elsewhere, placement choices are made. When it comes back and there are excess replicas, deletion choices are made. Both these choices need to be aware of not just the best placement for data loss reasons, and the best placement for performance reasons.

          Ideally, a given file, direcotory, or any randomly chosen subset of blocks in the system will be distributed randomly over the cluster.
          What actually happens in a growing cluster with new machines being provisioned, replication factors changing in some areas, nodes or racks dissapearing and coming back is that the actual block placement depends on the history of the cluster. Files written long ago may only be on subsets of machines. Blocks involved in a prior temporary node failure may be inefficiently placed. Lots of blocks may have all their replicas in one rack due to this.

          I think if the goal is to reduce the chance of data loss, looking holistically at all the components in the system that make such decisions is going to yield better results than looking only at the file write process. Initial block placement can have some variations, but the biggest challenge is in maintaining good block placement IMO.

          Show
          scott_carey Scott Carey added a comment - I've had some general discussions with Dhruba about the balancer. It doesn't use the block placement policy, so I'll have to extend it separately. To me, the biggest causes of data risk seem to be parts of the system that don't adhere to good placement policy and not the default write placement policy. Furthermore this plan will decrease the probability of individual data loss but increase the amount of data lost when it does happen. Lastly, most clusters these days have 3 to 8 hard drives per node. This policy is blind to that, and it is a big factor in the sort of data loss concerned with here (a disk dead for good, not a temporary unavailability). It would seem that if you are trying to force a file to only exist on 3 nodes, you would want to go all the way down to 3 disks on 3 different machines. In fact, a policy that achieves some of the intended affect to some extent leaves the node placement exactly as it is now but requests that a file being written goes to a particular disk in a node (a large random number mod the number of disks in the node). If there were 4 disks per node, this would decrease the likelihood of individual file data loss by restricting the file placement for any given file to 1/4 of the disks, but the performance would be OK because the file is still distributed uniformly by node. Components that make block placement choices: Balancer: The balancer must be rack-aware and requires a separate placement policy tradeoff (3 blocks on 3 different racks is a good thing there). SetRep: Replication changes to files, and especially lowering of them, cause lots of placement inefficiencies. For example, if you have a large file, or a collection of files and the block count is much larger than the number of nodes, and you decrease the replication (say from 5 to 4), then the end result is that the nodes that had the highest disk usage at the time will now have ZERO blocks from those files. Node death / decomission; recommission and new nodes: When a node temporarily dissapears and its missing blocks are replicated elsewhere, placement choices are made. When it comes back and there are excess replicas, deletion choices are made. Both these choices need to be aware of not just the best placement for data loss reasons, and the best placement for performance reasons. Ideally, a given file, direcotory, or any randomly chosen subset of blocks in the system will be distributed randomly over the cluster. What actually happens in a growing cluster with new machines being provisioned, replication factors changing in some areas, nodes or racks dissapearing and coming back is that the actual block placement depends on the history of the cluster. Files written long ago may only be on subsets of machines. Blocks involved in a prior temporary node failure may be inefficiently placed. Lots of blocks may have all their replicas in one rack due to this. I think if the goal is to reduce the chance of data loss, looking holistically at all the components in the system that make such decisions is going to yield better results than looking only at the file write process. Initial block placement can have some variations, but the biggest challenge is in maintaining good block placement IMO.
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          @Joydeep: Using a term to describe the groups is a great idea. Just bear in mind that we are not dividing the cluster into exclusive groups. Each block has a limited number of blocks in which it can be replicated, but separate blocks might have different sets of nodes associated with, and the intersection between these sets might not be empty.

          Here is a sketch of the algorithm to make things a little more clear. This is just a simplification of the algorithm, and I am not describing several corner cases and reconfiguration scenarios.

          Configuration parameters:
             - R: rack window, distance from initial rack we are allowed to place replicas
             - M: machine window, size of the machine window within a rack
          
          Whenever network topology changes:
             - Sort racks into a logical ring, based on rack name
             - Sort nodes within each rack into logical rings, based on node names
          
          For the first replica:
             - Write to the local machine, if possible, or pick up a random one
          
          For the second replica:
             - Let r be the rack in which the first replica was placed
             - Let i be the index of the machine in r that keeps the first replica
             - Pick random rack r2 that is within R racks from r
             - Pick random machine m2 in r2 that is within window [i, (i+M-1)%racksize]
             - Place replica in m2
          
          For the third replica:
             - Given steps above, pick another random machine m3 in r2 that is within the same window used for m2
             - Make sure m2 != m3
             - Place replica in m3
          
          

          I hope this explanation helps solve the confusion.

          Show
          rschmidt Rodrigo Schmidt added a comment - @Joydeep: Using a term to describe the groups is a great idea. Just bear in mind that we are not dividing the cluster into exclusive groups. Each block has a limited number of blocks in which it can be replicated, but separate blocks might have different sets of nodes associated with, and the intersection between these sets might not be empty. Here is a sketch of the algorithm to make things a little more clear. This is just a simplification of the algorithm, and I am not describing several corner cases and reconfiguration scenarios. Configuration parameters: - R: rack window, distance from initial rack we are allowed to place replicas - M: machine window, size of the machine window within a rack Whenever network topology changes: - Sort racks into a logical ring, based on rack name - Sort nodes within each rack into logical rings, based on node names For the first replica: - Write to the local machine, if possible, or pick up a random one For the second replica: - Let r be the rack in which the first replica was placed - Let i be the index of the machine in r that keeps the first replica - Pick random rack r2 that is within R racks from r - Pick random machine m2 in r2 that is within window [i, (i+M-1)%racksize] - Place replica in m2 For the third replica: - Given steps above, pick another random machine m3 in r2 that is within the same window used for m2 - Make sure m2 != m3 - Place replica in m3 I hope this explanation helps solve the confusion.
          Hide
          jsensarma Joydeep Sen Sarma added a comment -

          one can't have a different node-group for each block/file. that would defeat the whole point. (in fact - every block today is in a 3-node node-group - and there are gazillions of such node groups that overlap).

          the reduction in data loss probability comes out of the fact that the odds of 3 nodes falling into the same node-group is small. (if they don't fall into the same node-group - there's no data loss).

          if the number of node groups is very large (because of overlaps) - then the probability of 3 failing nodes falling into the same node group will start going up (just because there are more node-groups to choose from). the more the node-groups are exclusive - the better. that means the number of node-groups is minimized wrt. a constant number of nodes. as i mentioned - the size of the node-group is dictated to some extent by re-replication bandwidth. one wants very small node groups - but that doesn't work because there's not enough re-replication bandwidth (a familiar problem in RAID).

          if u take some standard cluster (say 8 racks x 40 nodes) - how many distinct node groups would ur algorithm end up with?

          Show
          jsensarma Joydeep Sen Sarma added a comment - one can't have a different node-group for each block/file. that would defeat the whole point. (in fact - every block today is in a 3-node node-group - and there are gazillions of such node groups that overlap). the reduction in data loss probability comes out of the fact that the odds of 3 nodes falling into the same node-group is small. (if they don't fall into the same node-group - there's no data loss). if the number of node groups is very large (because of overlaps) - then the probability of 3 failing nodes falling into the same node group will start going up (just because there are more node-groups to choose from). the more the node-groups are exclusive - the better. that means the number of node-groups is minimized wrt. a constant number of nodes. as i mentioned - the size of the node-group is dictated to some extent by re-replication bandwidth. one wants very small node groups - but that doesn't work because there's not enough re-replication bandwidth (a familiar problem in RAID). if u take some standard cluster (say 8 racks x 40 nodes) - how many distinct node groups would ur algorithm end up with?
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          I'm not sure I understand what you mean by node group. Sometimes it seems to be the machines in which you can have replicas for the same block (always triples), and sometimes it seems to be the bigger set from which you can pick replicas from. Which one is the right one? I guess a more formal definition could break the redundancy.

          Show
          rschmidt Rodrigo Schmidt added a comment - I'm not sure I understand what you mean by node group. Sometimes it seems to be the machines in which you can have replicas for the same block (always triples), and sometimes it seems to be the bigger set from which you can pick replicas from. Which one is the right one? I guess a more formal definition could break the redundancy.
          Hide
          shv Konstantin Shvachko added a comment -

          Rodrigo, I understand now that you place replicas on a limited group of racks. Sounds like Ceph's placement groups.
          But the document on probabilities does not address this case, does it? How do you know the probability of data loss is going to be less? How much?

          Show
          shv Konstantin Shvachko added a comment - Rodrigo, I understand now that you place replicas on a limited group of racks. Sounds like Ceph's placement groups. But the document on probabilities does not address this case, does it? How do you know the probability of data loss is going to be less? How much?
          Hide
          jsensarma Joydeep Sen Sarma added a comment -

          my mental model is that a node group is a fixed (at any given time) of nodes. together the node groups cover the entire cluster. it would be nice to have the node-groups to be exclusive - but it's not necessary. the reduction in data loss probability would largely come from the small size of each node group (combinatorically - the number of ways a small set of failures can all happen within one node group is much smaller compared to the total number of ways those same number of failures can happen).

          the assumption is that the writer would choose one node group (for a block it's writing). if node groups are exclusive and we want to have one copy locally - this means that the node group is fixed per node (all writers on that node would choose the node-group to which the node belongs). that is why it seems that an entire file would always belong the same node-group (as Hairong has argued).

          node groups cannot be too small - because (as i mentioned) - we would limit re-replication bandwidth (and therefore start increasing time to recovery for single bad disk/node). we really need a plot of time to recovery vs. node-group size to come up with a safe size for the node group. having exclusive groups is nice because it minimizes the number of node groups for a given node group size.

          Show
          jsensarma Joydeep Sen Sarma added a comment - my mental model is that a node group is a fixed (at any given time) of nodes. together the node groups cover the entire cluster. it would be nice to have the node-groups to be exclusive - but it's not necessary. the reduction in data loss probability would largely come from the small size of each node group (combinatorically - the number of ways a small set of failures can all happen within one node group is much smaller compared to the total number of ways those same number of failures can happen). the assumption is that the writer would choose one node group (for a block it's writing). if node groups are exclusive and we want to have one copy locally - this means that the node group is fixed per node (all writers on that node would choose the node-group to which the node belongs). that is why it seems that an entire file would always belong the same node-group (as Hairong has argued). node groups cannot be too small - because (as i mentioned) - we would limit re-replication bandwidth (and therefore start increasing time to recovery for single bad disk/node). we really need a plot of time to recovery vs. node-group size to come up with a safe size for the node group. having exclusive groups is nice because it minimizes the number of node groups for a given node group size.
          Hide
          jsensarma Joydeep Sen Sarma added a comment -

          > - Pick random rack r2 that is within R racks from r
          > - Pick random machine m2 in r2 that is within window [i, (i+M-1)%racksize]

          a few points:

          • dangerous to choose physically contiguous racks for node groups (because of correlated failures in consecutive racks). may make things a lot worse.
          • if rack numbering is based on some arithmetic (so that logically contiguous is not physically contiguous) - then one has to reason about what happens when new rack is added (i think it's ok to leave existing replicated data as is - but it's worth talking about this case. what would the rebalancer do in this case?)
          • easy to reduce the overlap between node-groups (and thereby decrease loss probability):
          • instead of [i, (i+M-1)%racksize] - choose [ (i / (r/M))*r/M, (i / (r/M))*r/M + M-1] // fixed offset groups of M nodes each in a rack

          i glanced through the Ceph algorithm (http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf) - it doesn't try to do what's described here. the number of placement groups is not controlled (and that is not an objective of the algo.).

          a close analog of the problem here is seen RAID arrays. When choosing to do parity + mirroring to tolerate multiple disk failures - one can choose to mirror RAID-parity groups or apply parity over RAID mirrored groups (1+5 vs. 5+1). turns out 5+1 is a lot better from a data loss probability perspective. the reasoning and math are similar (both are susceptible to data loss on 4-disk failures - but in 5+1 - the 4-disk failures have to be contained within 2 2-disk mirrored pairs. this is combinatorially much harder than the 1+5 case - where the the 4 disk failures have to cause 2 failures each in the N/2 node groups).

          Show
          jsensarma Joydeep Sen Sarma added a comment - > - Pick random rack r2 that is within R racks from r > - Pick random machine m2 in r2 that is within window [i, (i+M-1)%racksize] a few points: dangerous to choose physically contiguous racks for node groups (because of correlated failures in consecutive racks). may make things a lot worse. if rack numbering is based on some arithmetic (so that logically contiguous is not physically contiguous) - then one has to reason about what happens when new rack is added (i think it's ok to leave existing replicated data as is - but it's worth talking about this case. what would the rebalancer do in this case?) easy to reduce the overlap between node-groups (and thereby decrease loss probability): instead of [i, (i+M-1)%racksize] - choose [ (i / (r/M))*r/M, (i / (r/M))*r/M + M-1] // fixed offset groups of M nodes each in a rack i glanced through the Ceph algorithm ( http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf ) - it doesn't try to do what's described here. the number of placement groups is not controlled (and that is not an objective of the algo.). a close analog of the problem here is seen RAID arrays. When choosing to do parity + mirroring to tolerate multiple disk failures - one can choose to mirror RAID-parity groups or apply parity over RAID mirrored groups (1+5 vs. 5+1). turns out 5+1 is a lot better from a data loss probability perspective. the reasoning and math are similar (both are susceptible to data loss on 4-disk failures - but in 5+1 - the 4-disk failures have to be contained within 2 2-disk mirrored pairs. this is combinatorially much harder than the 1+5 case - where the the 4 disk failures have to cause 2 failures each in the N/2 node groups).
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          @Konstantin: We have a script that calculates the probabilities. The author is on vacation, so I can't just upload it here without his consent (will do that as soon as he agrees). I will post a table with values as soon as I get some time. Right now I'm trying to finish up the first version of the code and test it.

          @Joydeep:

          > a node group is a fixed (at any given time) of nodes
          ??? I think you missed the most important part of the sentence

          I tried to get the meaning from the rest of your text but some things still look weird. You seem to convey that node groups might be bigger than than the number of replicas you want to create. If so, how do you choose the replicas within the node group. In the algorithm I'm implementing it's not the case that any choice of 3 nodes is valid within what I think you are calling a node group.

          > instead of [i, (i+M-1)%racksize] - choose [ (i / (r/M))*r/M, (i / (r/M))*r/M + M-1] // fixed offset groups of M nodes each in a rack

          And how do you cope with racks of different sizes in this case?

          Show
          rschmidt Rodrigo Schmidt added a comment - @Konstantin: We have a script that calculates the probabilities. The author is on vacation, so I can't just upload it here without his consent (will do that as soon as he agrees). I will post a table with values as soon as I get some time. Right now I'm trying to finish up the first version of the code and test it. @Joydeep: > a node group is a fixed (at any given time) of nodes ??? I think you missed the most important part of the sentence I tried to get the meaning from the rest of your text but some things still look weird. You seem to convey that node groups might be bigger than than the number of replicas you want to create. If so, how do you choose the replicas within the node group. In the algorithm I'm implementing it's not the case that any choice of 3 nodes is valid within what I think you are calling a node group. > instead of [i, (i+M-1)%racksize] - choose [ (i / (r/M))*r/M, (i / (r/M))*r/M + M-1] // fixed offset groups of M nodes each in a rack And how do you cope with racks of different sizes in this case?
          Hide
          jsensarma Joydeep Sen Sarma added a comment -

          - of course i meant a node group = fixed set of nodes.

          and of course - replication is done within any three within a node group (and the node group is much larger than 3 of course). i understand there's a subtext to this - that we do want the replicas to span racks and that the writer writes one locally and one within the same rack. that's all good and consistent though. as long as the node group spans multiple racks and there are two from the writer's rack - we are good. i don't see how this is all that different from what the protocol u listed - where u are choosing nodes randomly subject to some constraints.

          the main difference is that i am suggesting minimizing the number of node groups. ur protocol implicitly has a finite number of node groups - but i am trying to argue that it's much much more than it needs to be. the math is pretty obvious (though tedious). the reduction in loss probabilities comes from the small node group. but it is offset by the larger number of node groups. if one doesn't minimize the number of node groups - we start losing out on the benefits of this scheme.

          good point about racks with different number of nodes. let me think about it.

          Show
          jsensarma Joydeep Sen Sarma added a comment - - of course i meant a node group = fixed set of nodes. and of course - replication is done within any three within a node group (and the node group is much larger than 3 of course). i understand there's a subtext to this - that we do want the replicas to span racks and that the writer writes one locally and one within the same rack. that's all good and consistent though. as long as the node group spans multiple racks and there are two from the writer's rack - we are good. i don't see how this is all that different from what the protocol u listed - where u are choosing nodes randomly subject to some constraints. the main difference is that i am suggesting minimizing the number of node groups. ur protocol implicitly has a finite number of node groups - but i am trying to argue that it's much much more than it needs to be. the math is pretty obvious (though tedious). the reduction in loss probabilities comes from the small node group. but it is offset by the larger number of node groups. if one doesn't minimize the number of node groups - we start losing out on the benefits of this scheme. good point about racks with different number of nodes. let me think about it.
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          I might be wrong, but it seems to me that you are oversimplifying things and the notion of node groups as you present is not sound.

          However, I can confirm that with a clear definition of what a 'node group' is. When you say 'fixed set of nodes', for me it could be all the nodes, it could be 2-node sets, it could be a partial set, it could be pretty much any subset. What characterizes valid node groups? How many of them there are? What is their size? I can't discuss in terms of something that is not defined well.

          What I understand is the restriction you made in the selection of nodes for the algorithm I described. Kannan, Karthik, Aravind, Dhruba, I, and others discussed about that several times when we first started wondering about data loss probabilities in HDFS. From what I can remember, the main argument against it is that it doesn't generalize gracefully when racks can have different sizes, and it was not clear how much better it was compared to the current approach or if the current approach could be just configured to have the same data loss probability by simply reducing the size of the windows (Note that the current approach is configurable and we can pretty much decide what is the data loss probability we will have based on the cluster configuration and the window sizes). If the others are following this discussion, they might want to chime in and say a few more words about what else made us give up on this idea before.

          Show
          rschmidt Rodrigo Schmidt added a comment - I might be wrong, but it seems to me that you are oversimplifying things and the notion of node groups as you present is not sound. However, I can confirm that with a clear definition of what a 'node group' is. When you say 'fixed set of nodes', for me it could be all the nodes, it could be 2-node sets, it could be a partial set, it could be pretty much any subset. What characterizes valid node groups? How many of them there are? What is their size? I can't discuss in terms of something that is not defined well. What I understand is the restriction you made in the selection of nodes for the algorithm I described. Kannan, Karthik, Aravind, Dhruba, I, and others discussed about that several times when we first started wondering about data loss probabilities in HDFS. From what I can remember, the main argument against it is that it doesn't generalize gracefully when racks can have different sizes, and it was not clear how much better it was compared to the current approach or if the current approach could be just configured to have the same data loss probability by simply reducing the size of the windows (Note that the current approach is configurable and we can pretty much decide what is the data loss probability we will have based on the cluster configuration and the window sizes). If the others are following this discussion, they might want to chime in and say a few more words about what else made us give up on this idea before.
          Hide
          jsensarma Joydeep Sen Sarma added a comment -

          (this is the closest i will get to writing a book. dedicated to u Rodrigo)

          what else is there to say? i have said it all:

          • node group is fixed (at a given point in time) set of nodes
          • all the node groups together span the entire cluster
          • a given block is replicated only within member nodes of a single node group

          this is the definition. there are no 'invalid' node groups. but node-groups have certain properties that we know to be highly desirable:

          • node groups should ideally, but not necessarily, be disjoint. the probability of data loss will be positively correlated to number of node groups and inversely correlated to the size of the node group. overlapping node groups cause excessive number of node groups for the same sized node group and hence are not desirable.
          • should all node groups be of equal size?: ideally yes. this will reduce the loss probability to a minimum (holding other factors constant)
          • all node groups should span multiple racks: this is because we know that racks are a unit of failure. so to accomodate a replication policy that spans racks - we need the node group to span racks. how many racks it should span is ideally configurable.
          • similarly, we know that hdfs wants the writer to make one node local and one rack-local copy. that pretty much means that each node group has at least 2 nodes for each rack it spans. if it's chosen as
          • because replication does not cross a node-group boundary - recovery from disk/node failure is constrained by re-replication bandwidth available inside a node group. this is inversely proportional to the size of the node-group. if the node-group is too small - then the time to recovery (from bad disk/node) goes up. so one cannot have an extremely small node group (like a 2-node group).

          i cannot suggest the optimal size of a node-group without more data on this front. if one increases MTTR - then the odds of data loss go up. others have said the same thing and this is data storage 101. so unless someone plots the time to recovery for single node failure (for a reasonable sized node) vs. the size of a node-group - it's hard to say what's high enough.

          if re-replication bandwidth was infinite - then one could have really small node-groups. this is what was done in the old days - all disks were mirrored in pairs. and disks were small enough that a mirrored disk pair had a very small MTTR. of course, disks have become 1000-fold in size since then and their bandwidth hasn't increased much. so this strategy doesn't work anymore.

          • it would be foolhardy to define node groups in a way that puts correlated units of failure in the same node-group. this is pretty obvious. which is why choosing contiguous racks for a node group may not be a good approach at all (given common rack based rollout strategies for new hardware as well as likely correlation in temperature)

          i would love to learn about how i am oversimplifying things. it's really simple combinatorics for me:

          • when considering simultaneous failures - the data loss probability will be dominated by what happens when ,number of nodes = degree of replication = 3(say), fail.
          • if one considers the 3 node failure scenario - to cause data loss - one must place all the 3 failed nodes in the same node-group. the number of ways one can do this is:
            a. directly proportional to the number of node groups (obviously)
            b. very strongly inversely correlated to the size of the node group

          as a simple example - if i were to divide 100 nodes into

          • 10 groups of 10 each => 10 * 10C3 = 1200
          • 5 groups of 20 each => 5 * 20 C 3 = 5700

          as u can see - a 2 fold reduction in group size leads to about 6 fold reduction in number of ways of hitting failure (and hence in the odds of failure).

          • even if u consider higher numbers of failures - similar combinatorics still apply.

          i hope the arguments are crystal clear now - all else being equal - it's a bad idea to have overlapping node-groups. we will be holding factor (b) constant - but raising factor (a).

          beyond this - as i said - i don't have a precise layout strategy right now. the naiivest approach is what u said u may have already discussed. we don't have heterogenous clusters. even if we had - we may have (at worst) a mix of 1U and 2U nodes - one could just consider the 2U nodes to be = 2 x 1U. so in the naiive strategy, i would:

          • treat the nodes and racks as a 2-D grid.
          • random hash the rackids into a different id space for purposes of labelling on this 2-D grid (logically contiguous is not physically contiguous). one could do that with nodes as well.
          • divide the grid into NxM sized disjoint node-groups (a rectangular sub-grid) where N = racks and M = nodes. Admin decides
          • NxM based on re-replication bandwidth
          • N based on desired rack fault tolerance
          • N >=2 (3 at least i imagine). M >= 2
          • on rack addition - on the rackid space - it will be likely inserted in the middle of existing racks. in the naiive strategy:
          • only newly written blocks would be written as per the revised node-group definitions
          • the system would have more than the minimum possible number of node-groups (it would have the new groups as well as the old groups). but that's ok.

          i think smarter strategies would differ primarily based on how elegantly they handle rack insertions. the other way to handle this stuff is to have node groups not be hard wired based on topology - but just have software generate node groups and their members. it's much easier to adjust for reconfigurations - one can easily come up with protocols to remap node-groups to make full use of new rack - but minimize changes to existing groups. this sort of control data is rarely mutated and no one dies if everyone's not in sync. so it's an easy thing to share in a cluster.

          Show
          jsensarma Joydeep Sen Sarma added a comment - (this is the closest i will get to writing a book. dedicated to u Rodrigo) what else is there to say? i have said it all: node group is fixed (at a given point in time) set of nodes all the node groups together span the entire cluster a given block is replicated only within member nodes of a single node group this is the definition. there are no 'invalid' node groups. but node-groups have certain properties that we know to be highly desirable: node groups should ideally, but not necessarily, be disjoint. the probability of data loss will be positively correlated to number of node groups and inversely correlated to the size of the node group. overlapping node groups cause excessive number of node groups for the same sized node group and hence are not desirable. should all node groups be of equal size?: ideally yes. this will reduce the loss probability to a minimum (holding other factors constant) all node groups should span multiple racks: this is because we know that racks are a unit of failure. so to accomodate a replication policy that spans racks - we need the node group to span racks. how many racks it should span is ideally configurable. similarly, we know that hdfs wants the writer to make one node local and one rack-local copy. that pretty much means that each node group has at least 2 nodes for each rack it spans. if it's chosen as because replication does not cross a node-group boundary - recovery from disk/node failure is constrained by re-replication bandwidth available inside a node group. this is inversely proportional to the size of the node-group. if the node-group is too small - then the time to recovery (from bad disk/node) goes up. so one cannot have an extremely small node group (like a 2-node group). i cannot suggest the optimal size of a node-group without more data on this front. if one increases MTTR - then the odds of data loss go up. others have said the same thing and this is data storage 101. so unless someone plots the time to recovery for single node failure (for a reasonable sized node) vs. the size of a node-group - it's hard to say what's high enough. if re-replication bandwidth was infinite - then one could have really small node-groups. this is what was done in the old days - all disks were mirrored in pairs. and disks were small enough that a mirrored disk pair had a very small MTTR. of course, disks have become 1000-fold in size since then and their bandwidth hasn't increased much. so this strategy doesn't work anymore. it would be foolhardy to define node groups in a way that puts correlated units of failure in the same node-group. this is pretty obvious. which is why choosing contiguous racks for a node group may not be a good approach at all (given common rack based rollout strategies for new hardware as well as likely correlation in temperature) i would love to learn about how i am oversimplifying things. it's really simple combinatorics for me: when considering simultaneous failures - the data loss probability will be dominated by what happens when ,number of nodes = degree of replication = 3(say), fail. if one considers the 3 node failure scenario - to cause data loss - one must place all the 3 failed nodes in the same node-group. the number of ways one can do this is: a. directly proportional to the number of node groups (obviously) b. very strongly inversely correlated to the size of the node group as a simple example - if i were to divide 100 nodes into 10 groups of 10 each => 10 * 10C3 = 1200 5 groups of 20 each => 5 * 20 C 3 = 5700 as u can see - a 2 fold reduction in group size leads to about 6 fold reduction in number of ways of hitting failure (and hence in the odds of failure). even if u consider higher numbers of failures - similar combinatorics still apply. i hope the arguments are crystal clear now - all else being equal - it's a bad idea to have overlapping node-groups. we will be holding factor (b) constant - but raising factor (a). beyond this - as i said - i don't have a precise layout strategy right now. the naiivest approach is what u said u may have already discussed. we don't have heterogenous clusters. even if we had - we may have (at worst) a mix of 1U and 2U nodes - one could just consider the 2U nodes to be = 2 x 1U. so in the naiive strategy, i would: treat the nodes and racks as a 2-D grid. random hash the rackids into a different id space for purposes of labelling on this 2-D grid (logically contiguous is not physically contiguous). one could do that with nodes as well. divide the grid into NxM sized disjoint node-groups (a rectangular sub-grid) where N = racks and M = nodes. Admin decides NxM based on re-replication bandwidth N based on desired rack fault tolerance N >=2 (3 at least i imagine). M >= 2 on rack addition - on the rackid space - it will be likely inserted in the middle of existing racks. in the naiive strategy: only newly written blocks would be written as per the revised node-group definitions the system would have more than the minimum possible number of node-groups (it would have the new groups as well as the old groups). but that's ok. i think smarter strategies would differ primarily based on how elegantly they handle rack insertions. the other way to handle this stuff is to have node groups not be hard wired based on topology - but just have software generate node groups and their members. it's much easier to adjust for reconfigurations - one can easily come up with protocols to remap node-groups to make full use of new rack - but minimize changes to existing groups. this sort of control data is rarely mutated and no one dies if everyone's not in sync. so it's an easy thing to share in a cluster.
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          According to your definition, what prevents me from saying that there is only one node group, which is the set of all nodes in the cluster? Or why not to say that all triples are node groups? Or any combination of k>3 nodes?

          What makes me think this definition is not sound is the fact that it doesn't link node groups to replication policies. For the rationale to be sound, there should be something saying what is the node group a policy generates.

          Your examples seem to suggest maximal sets from which any triple is a valid selection for replication. If that's the case, you are oversimplifying things. The restriction that one replica is on a rack and two replicas are in another one prevents several triples in both approaches, and this has non-negligible impact in the math.

          Your argument about disjoint and non-disjoint groups is valid only if you consider groups of equal size. A strategy with big non-disjoint groups can be as good as better than one with disjoint groups if you reduce the size of the groups it defines.

          Show
          rschmidt Rodrigo Schmidt added a comment - According to your definition, what prevents me from saying that there is only one node group, which is the set of all nodes in the cluster? Or why not to say that all triples are node groups? Or any combination of k>3 nodes? What makes me think this definition is not sound is the fact that it doesn't link node groups to replication policies. For the rationale to be sound, there should be something saying what is the node group a policy generates. Your examples seem to suggest maximal sets from which any triple is a valid selection for replication. If that's the case, you are oversimplifying things. The restriction that one replica is on a rack and two replicas are in another one prevents several triples in both approaches, and this has non-negligible impact in the math. Your argument about disjoint and non-disjoint groups is valid only if you consider groups of equal size. A strategy with big non-disjoint groups can be as good as better than one with disjoint groups if you reduce the size of the groups it defines.
          Hide
          jsensarma Joydeep Sen Sarma added a comment -

          @Rodrigo - it's evident u have not read my post beyond first three lines. please read through in detail. if ur questions are not answered (ur comments are all categorically wrong. i have said none of the things u say i have said - and quite to the contrary) - please come and talk.

          Show
          jsensarma Joydeep Sen Sarma added a comment - @Rodrigo - it's evident u have not read my post beyond first three lines. please read through in detail. if ur questions are not answered (ur comments are all categorically wrong. i have said none of the things u say i have said - and quite to the contrary) - please come and talk.
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          I had read it before. I read it again now and I still have the same questions. I guess I'm probably missing something.

          Anyway, the whole discussion about the node group definition might not be essential. We can probably take it offline.

          I understand your proposal about having disjoint sets of nodes, and I think it is a valid one. We decided to do it differently because the current algorithm seemed to generalize better and could get to better probabilities by reducing window sizes.

          Show
          rschmidt Rodrigo Schmidt added a comment - I had read it before. I read it again now and I still have the same questions. I guess I'm probably missing something. Anyway, the whole discussion about the node group definition might not be essential. We can probably take it offline. I understand your proposal about having disjoint sets of nodes, and I think it is a valid one. We decided to do it differently because the current algorithm seemed to generalize better and could get to better probabilities by reducing window sizes.
          Hide
          scott_carey Scott Carey added a comment -

          The "node group" versus Rodrigo's definition:

          Node groups – yes, overlap is bad. But, no overlap is also bad. Consider:
          If there is no overlap, and the policy is to write a 3 replica block to two racks, one with two replicas and another with one, then the minimal group size is four – you must have two rack and you must have at least two nodes per rack. A larger group must always have two nodes per rack. With N racks in the node group, a rack failure causes 1/N of the blocks to be reduced to 1 replica.

          It is possible to define overlapping groups with a better worst-case scenario here, with the tradeoff being increased risk due to overlap.

          However, this whole enterprise needs to be built on a better foundation.
          some problems-

          Disks versus nodes:

          Start considering disk failure, not just node failure. Newer Hadoop versions don't fail the node when one disk in the node fails. There are several strategies to reduce data loss without impacting time to replicate regarding disk placement instead of node placement. This is because the dominant factor in time to replicate is network bandwidth, which can only be improved for replication if node groups or placement policy spans more racks and nodes.
          What we should really be talking about here is "disk groups" and "disk placement policy" not "node groups" and "node placement policy".
          Disk is to Node as Node is to Rack.
          Limiting the combinatorics at the rack level is the most dangerous and increases time to recovery the most.
          Limiting the combinatorics at the node level is moderately dangerous because it increases time to recovery somewhat.
          Limiting it at the disk level is a 'free' win, since it does not increase the time to recovery much at all – that is limited mostly by network bandwidth.

          Consider this – in a 400 node cluster each with 4 disks, there are C(400,3) ways for node failure to occur.
          But, there are C(1600,3) ways for disk failure to occur. If you restrict the way replicas are placed not by node or rack, but simply by disk (say, take the block # mod the # of disks in a node to choose which node), then this reduces the number of disk failure combinations that can cause data loss from C(1600,3) to 4*C(400,3). This is a factor of 16 improvement (# of disks ^2) without any loss in TTR as long as disk speed >= network speed.

          Time to recovery. Since the odds of 3 simultaneous failures depend highly on TTR, this is a huge factor.

          Here is a simple example of how time to recovery is impacted. The time to recover is roughly equal to:
          The maximum ammount of data that a single 'survivor' node has that must be replicated. If we simplify this to situations where replication has dropped to 1, this is easy to analyze. You must get the data off that worst case node.
          There is a similar analysis that has to be done for a rack as well – 'sole survivor' blocks in a rack have to be moved off rack or data loss will be a problem.

          For a node, the limit is the network, usually 1Gbit/sec or ~100MBsec. To move 1TB of data, that takes ~10,000 seconds, or 2 hours and 47 minutes! Brand new Hadoop hardware tends to have 8TB on a node these days (though network bonding to 2Gbps is more common).
          Bandwidth between racks tends to be between 1/16 and 1/4 of the aggregate bandwidth within a rack. So worst case time to replicate tends to be longer than for a node if the goal is to make sure all lost data exists on at least two racks.

          So how does that affect things?
          Lets assume a node has a 1 in 10,000 chance of failure in a given hour (there are 8760 hours in a year). Lets assume only 3 nodes for now.
          If a node fails, then the odds of the other two failing are (1/10000)^2, one in 100,000,000.
          If the time to recover is 3 hours instead of 1, the odds are (3/10000)^2, one in 11,111,111. Increasing the time to recover by a factor of 3 increases the chance of data loss by a factor of 9. If this was a 4-replica situation, it would be an even stronger relation.
          The relationship between TTR and chance of data loss is
          (TTR)^(R-1)
          where TTR is the time to recover and R is the number of replicas.
          For node failure, we consider the number of replicas = 3. For rack failure, it is 2 (blocks are only on 2 racks). So node failure is more sensitive to TTR than rack failure, but the latter is obviously more dangerous.

          "node groups" or "restricted placement" and TTR

          If the replication algorithm prioritizes the lowest replication count blocks first, then the TTR is proportional to the size of the 'sole survivor' data on the node with the most such data.

          M racks, N nodes per rack, 4TB data per node, 1GBps node bandwidth, 10Gbps rack bandwidth.
          Analysis – current (random) policy: write local, then choose random rack and node for second replica, and for third replica choose random node in same rack as second node.

          • Rack Failure: When a rack fails, it leaves behind many 'sole survivor' blocks. In general, 2/3 of its blocks have only one replica outside the rack and require intra-rack transfer. So 2N * 4TB/3 GB have to be transferred. The other 1/3 can be transferred in rack, which is not the limiting factor here.
            These replicas are evenly spread over the remaining M-1 racks, and copying immediately starts – replicating blocks to another rack and then to another node in that other rack. The rack bandwidth is the limiting factor:
            2/3(N * 4TB) /(M-1) transferred off of each rack to another. At 10Gbps rack bandwidth this is ~1000MB/sec. If we assume 40 servers in a rack and 10 racks this is ~11.8TB at 1000MB/sec or ~=11800 seconds ~=3.3 hours.
          • Node Failure: A single node, provided a perfect placement policy by the NN, balancer, and writers, should not have two replicas of one block. But its TTR to get the replica count back to the pre-failure state is as follows:
            All 4TB of blocks have to move, but these are spread out in many racks. 1/3 of the node's data was written by it, and those replicas are all off-rack in pairs. Only one of these pairs however has to be considered to generate a copy. However, in order to maintain good block placement this replication has to go off rack. So all replication goes between racks for 1/3 the data. 2/3 of the data is 'paired' data where one other block is already in-rack and the other is in another rack. This data can be replicated within rack. If scheduled optimally, 2/3 the data is replicated in rack evenly, and 1/3 is replicated cross rack evenly.
            In-rack replication is limited by the node bandwidth, and the 2/3 of the data that needs this sort of replication is spread across the remaining nodes – it is not spread evenly (there is more in the rack that had the node fail) but will be anywhere on the cluster. So this is 2/3 * 4TB / (N*M -1) = 6.7GB per node @ 1Gbps = 67 seconds.
            The 1/3 of the data that must flow between racks takes more – it is evenly distributed across nodes on other racks, so each rack must transfer 1/3 * 4TB / M data off it, in this example 133GB at 10Gbps = 133 seconds.

          Now, what about a policy that created the minimum 4 node 'node group'. Two in one rack, two in another? If a rack dies, the calculation is similar to above. Assuming node groups have random pair other racks with no limitations , then the "survivor" parts of the group must choose new partner node pairs and replicate to them, off-rack. 2/3 of the blocks will have only one replica, 1/3 will have two. None of it is off-rack, so all of it must be transferred. This means the time to replicate is 3/2 the policy above, about 4.9 hours. This directly translates into a 50% increased chance of data loss or availability loss from rack failure.

          When a node dies out of the 'set' of four, all of the blocks it wrote are on the two nodes in the other rack (1/3 of its data). 2/3 of its data was shared by its 'peer' in the same rack, which now must replicate all of its data to a new peer in the same rack or elsewhere. Lets give it the benefit of the doubt and assume it can choose a peer in the same rack. It is limited by this replication – 2/3 of 4TB from one node to another. This takes 7.4 hours. The other nodes in the other rack must also write to this peer to up their replication from 2 to 3, which increases this time even further because it can only write so fast. Those blocks can be replicated to the other node instead, but then these will be very un-balanced. The absolute minimum time is the 7.4 hours above, and the 'balanced' time is 1.5x that.

          factoring in TTR

          So, if the odds that a failure of 3 nodes with replication = 3 loses a block's data on a 400 node cluster with uniform block placement is one in C(400,3) = one in 10.6 million.
          With 100 groups of 4 nodes, the likelihood is the same, for a given block. However, for all blocks, it is reduced with the exchange that when a failure occurs, more blocks will be lost.
          There was C(400,3) different combinations that would lead to data loss, and now there are only (100 * C(4,3)) that will.
          This is a reduction in data loss probability by (C(400,3)/(100*C(4,3))) or a factor of 158,000.
          However, since the time to replicate is now not 133 seconds, but 26,667 to 40,000 seconds, 200 to 300 times longer. This increases the likelihood of such a 3 node failure dramatically. At best this results in an increase in the chance of such a condition by a factor of 40,000. And for blocks that are higher replication, its even worse – 8,000,000+.

          So, while this strategy will likely help out repication = 2 loss, and sometimes help replication = 3, it is disastrous for replication = 4 or more.
          Furthermore, it complicates block placement, removal, and emergency replication significantly.

          To have higher data safety, I feel that working on reducing TTR and preventing normal cluster activity from causing bad placement and distribution is most important.

          However, a disk-based strategy can be of use without impacting TTR. Disk failure, node failure, and rack failure are all different beasts here.

          Show
          scott_carey Scott Carey added a comment - The "node group" versus Rodrigo's definition: Node groups – yes, overlap is bad. But, no overlap is also bad. Consider: If there is no overlap, and the policy is to write a 3 replica block to two racks, one with two replicas and another with one, then the minimal group size is four – you must have two rack and you must have at least two nodes per rack. A larger group must always have two nodes per rack. With N racks in the node group, a rack failure causes 1/N of the blocks to be reduced to 1 replica. It is possible to define overlapping groups with a better worst-case scenario here, with the tradeoff being increased risk due to overlap. However, this whole enterprise needs to be built on a better foundation. some problems- Disks versus nodes: Start considering disk failure, not just node failure. Newer Hadoop versions don't fail the node when one disk in the node fails. There are several strategies to reduce data loss without impacting time to replicate regarding disk placement instead of node placement. This is because the dominant factor in time to replicate is network bandwidth, which can only be improved for replication if node groups or placement policy spans more racks and nodes. What we should really be talking about here is "disk groups" and "disk placement policy" not "node groups" and "node placement policy". Disk is to Node as Node is to Rack. Limiting the combinatorics at the rack level is the most dangerous and increases time to recovery the most. Limiting the combinatorics at the node level is moderately dangerous because it increases time to recovery somewhat. Limiting it at the disk level is a 'free' win, since it does not increase the time to recovery much at all – that is limited mostly by network bandwidth. Consider this – in a 400 node cluster each with 4 disks, there are C(400,3) ways for node failure to occur. But, there are C(1600,3) ways for disk failure to occur. If you restrict the way replicas are placed not by node or rack, but simply by disk (say, take the block # mod the # of disks in a node to choose which node), then this reduces the number of disk failure combinations that can cause data loss from C(1600,3) to 4*C(400,3). This is a factor of 16 improvement (# of disks ^2) without any loss in TTR as long as disk speed >= network speed. Time to recovery. Since the odds of 3 simultaneous failures depend highly on TTR, this is a huge factor. Here is a simple example of how time to recovery is impacted. The time to recover is roughly equal to: The maximum ammount of data that a single 'survivor' node has that must be replicated. If we simplify this to situations where replication has dropped to 1, this is easy to analyze. You must get the data off that worst case node. There is a similar analysis that has to be done for a rack as well – 'sole survivor' blocks in a rack have to be moved off rack or data loss will be a problem. For a node, the limit is the network, usually 1Gbit/sec or ~100MBsec. To move 1TB of data, that takes ~10,000 seconds, or 2 hours and 47 minutes! Brand new Hadoop hardware tends to have 8TB on a node these days (though network bonding to 2Gbps is more common). Bandwidth between racks tends to be between 1/16 and 1/4 of the aggregate bandwidth within a rack. So worst case time to replicate tends to be longer than for a node if the goal is to make sure all lost data exists on at least two racks. So how does that affect things? Lets assume a node has a 1 in 10,000 chance of failure in a given hour (there are 8760 hours in a year). Lets assume only 3 nodes for now. If a node fails, then the odds of the other two failing are (1/10000)^2, one in 100,000,000. If the time to recover is 3 hours instead of 1, the odds are (3/10000)^2, one in 11,111,111. Increasing the time to recover by a factor of 3 increases the chance of data loss by a factor of 9. If this was a 4-replica situation, it would be an even stronger relation. The relationship between TTR and chance of data loss is (TTR)^(R-1) where TTR is the time to recover and R is the number of replicas. For node failure, we consider the number of replicas = 3. For rack failure, it is 2 (blocks are only on 2 racks). So node failure is more sensitive to TTR than rack failure, but the latter is obviously more dangerous. "node groups" or "restricted placement" and TTR If the replication algorithm prioritizes the lowest replication count blocks first, then the TTR is proportional to the size of the 'sole survivor' data on the node with the most such data. M racks, N nodes per rack, 4TB data per node, 1GBps node bandwidth, 10Gbps rack bandwidth. Analysis – current (random) policy: write local, then choose random rack and node for second replica, and for third replica choose random node in same rack as second node. Rack Failure: When a rack fails, it leaves behind many 'sole survivor' blocks. In general, 2/3 of its blocks have only one replica outside the rack and require intra-rack transfer. So 2N * 4TB/3 GB have to be transferred. The other 1/3 can be transferred in rack, which is not the limiting factor here. These replicas are evenly spread over the remaining M-1 racks, and copying immediately starts – replicating blocks to another rack and then to another node in that other rack. The rack bandwidth is the limiting factor: 2/3(N * 4TB) /(M-1) transferred off of each rack to another. At 10Gbps rack bandwidth this is ~1000MB/sec. If we assume 40 servers in a rack and 10 racks this is ~11.8TB at 1000MB/sec or ~=11800 seconds ~=3.3 hours. Node Failure: A single node, provided a perfect placement policy by the NN, balancer, and writers, should not have two replicas of one block. But its TTR to get the replica count back to the pre-failure state is as follows: All 4TB of blocks have to move, but these are spread out in many racks. 1/3 of the node's data was written by it, and those replicas are all off-rack in pairs. Only one of these pairs however has to be considered to generate a copy. However, in order to maintain good block placement this replication has to go off rack. So all replication goes between racks for 1/3 the data. 2/3 of the data is 'paired' data where one other block is already in-rack and the other is in another rack. This data can be replicated within rack. If scheduled optimally, 2/3 the data is replicated in rack evenly, and 1/3 is replicated cross rack evenly. In-rack replication is limited by the node bandwidth, and the 2/3 of the data that needs this sort of replication is spread across the remaining nodes – it is not spread evenly (there is more in the rack that had the node fail) but will be anywhere on the cluster. So this is 2/3 * 4TB / (N*M -1) = 6.7GB per node @ 1Gbps = 67 seconds. The 1/3 of the data that must flow between racks takes more – it is evenly distributed across nodes on other racks, so each rack must transfer 1/3 * 4TB / M data off it, in this example 133GB at 10Gbps = 133 seconds. Now, what about a policy that created the minimum 4 node 'node group'. Two in one rack, two in another? If a rack dies, the calculation is similar to above. Assuming node groups have random pair other racks with no limitations , then the "survivor" parts of the group must choose new partner node pairs and replicate to them, off-rack. 2/3 of the blocks will have only one replica, 1/3 will have two. None of it is off-rack, so all of it must be transferred. This means the time to replicate is 3/2 the policy above, about 4.9 hours. This directly translates into a 50% increased chance of data loss or availability loss from rack failure. When a node dies out of the 'set' of four, all of the blocks it wrote are on the two nodes in the other rack (1/3 of its data). 2/3 of its data was shared by its 'peer' in the same rack, which now must replicate all of its data to a new peer in the same rack or elsewhere. Lets give it the benefit of the doubt and assume it can choose a peer in the same rack. It is limited by this replication – 2/3 of 4TB from one node to another. This takes 7.4 hours. The other nodes in the other rack must also write to this peer to up their replication from 2 to 3, which increases this time even further because it can only write so fast. Those blocks can be replicated to the other node instead, but then these will be very un-balanced. The absolute minimum time is the 7.4 hours above, and the 'balanced' time is 1.5x that. factoring in TTR So, if the odds that a failure of 3 nodes with replication = 3 loses a block's data on a 400 node cluster with uniform block placement is one in C(400,3) = one in 10.6 million. With 100 groups of 4 nodes, the likelihood is the same, for a given block. However, for all blocks, it is reduced with the exchange that when a failure occurs, more blocks will be lost. There was C(400,3) different combinations that would lead to data loss, and now there are only (100 * C(4,3)) that will. This is a reduction in data loss probability by (C(400,3)/(100*C(4,3))) or a factor of 158,000. However, since the time to replicate is now not 133 seconds, but 26,667 to 40,000 seconds, 200 to 300 times longer. This increases the likelihood of such a 3 node failure dramatically. At best this results in an increase in the chance of such a condition by a factor of 40,000. And for blocks that are higher replication, its even worse – 8,000,000+. So, while this strategy will likely help out repication = 2 loss, and sometimes help replication = 3, it is disastrous for replication = 4 or more. Furthermore, it complicates block placement, removal, and emergency replication significantly. To have higher data safety, I feel that working on reducing TTR and preventing normal cluster activity from causing bad placement and distribution is most important. However, a disk-based strategy can be of use without impacting TTR. Disk failure, node failure, and rack failure are all different beasts here.
          Hide
          dhruba dhruba borthakur added a comment -

          This is a really an excellent discussion and something that is at the heart of providing good data availability guarantees of HDFS.

          In our production HDFS clusters, I have never seen "missing blocks" occuring at all, especially because HDFS causes blocks to replicate as soon as one of the replicas goes missing. Each of our nodes have 12 TB, but because we have a 2000 node cluster, the MTTR is small, less than 1 hour or so.

          However, I have seen "missing blocks" occur many many times when a large number of machines reboot at the simultaneously. This occurs when a power-failure event occurs or when a bad piece of software inadvertently decides to reboot many machines at the same time. Many cluster nodes refuse to startup after a power-failure event and needs manual intervention to resurrect them. Out of our 2000 node cluster, it is likely that 50 odd machines refuse to startup. These 50 machines are randomly scattered among all the 2000 nodes, their rack or node placement in the network topology usually has nothing to do with their refusal to startup. It is usually the case that a few blocks go missing until a human goes and restart most of those 50 machines.

          I have never seen simultaneous correlated disk failures on different nodes.

          So, for a cluster that has two thousands of nodes, it might make sense to create 20 node-groups , each of size 100 machines or so. For smaller size clusters, the current random-allocation policy might continue to prove the best. This should keep the TTR relatively small.

          I would like to hear from other system administrators on whether they frequently see "missing blocks" occurances and their reasoning/analysis on what could have caused those situations. It will help us better understand the best policy to adopt for block-placement.

          This JIRA does not yet talk about the engineering aspects of implementing a node-group. By design, these groups have to be persisted across cluster restarts. Rodrigo's approach says that arranging the IP address of nodes in some priority-order gives us a way to decide node groups. What are the command-line utilities to list/specify node-group sizes? do we allow the resizing of node-groups on a running cluster? can we say that this proposal requires the existence of a "includes" file so that the node-group assignment remains the same even if a few nodes are transiently out of service? do we define the priority-order of nodes so that it is circular in nature: if so, then it could cause some trouble when new nodes are added to the cluster. can we say that a new balancer has to be developed that detects and fixes block replicas based on the node-group policy?

          Show
          dhruba dhruba borthakur added a comment - This is a really an excellent discussion and something that is at the heart of providing good data availability guarantees of HDFS. In our production HDFS clusters, I have never seen "missing blocks" occuring at all, especially because HDFS causes blocks to replicate as soon as one of the replicas goes missing. Each of our nodes have 12 TB, but because we have a 2000 node cluster, the MTTR is small, less than 1 hour or so. However, I have seen "missing blocks" occur many many times when a large number of machines reboot at the simultaneously. This occurs when a power-failure event occurs or when a bad piece of software inadvertently decides to reboot many machines at the same time. Many cluster nodes refuse to startup after a power-failure event and needs manual intervention to resurrect them. Out of our 2000 node cluster, it is likely that 50 odd machines refuse to startup. These 50 machines are randomly scattered among all the 2000 nodes, their rack or node placement in the network topology usually has nothing to do with their refusal to startup. It is usually the case that a few blocks go missing until a human goes and restart most of those 50 machines. I have never seen simultaneous correlated disk failures on different nodes. So, for a cluster that has two thousands of nodes, it might make sense to create 20 node-groups , each of size 100 machines or so. For smaller size clusters, the current random-allocation policy might continue to prove the best. This should keep the TTR relatively small. I would like to hear from other system administrators on whether they frequently see "missing blocks" occurances and their reasoning/analysis on what could have caused those situations. It will help us better understand the best policy to adopt for block-placement. This JIRA does not yet talk about the engineering aspects of implementing a node-group. By design, these groups have to be persisted across cluster restarts. Rodrigo's approach says that arranging the IP address of nodes in some priority-order gives us a way to decide node groups. What are the command-line utilities to list/specify node-group sizes? do we allow the resizing of node-groups on a running cluster? can we say that this proposal requires the existence of a "includes" file so that the node-group assignment remains the same even if a few nodes are transiently out of service? do we define the priority-order of nodes so that it is circular in nature: if so, then it could cause some trouble when new nodes are added to the cluster. can we say that a new balancer has to be developed that detects and fixes block replicas based on the node-group policy?
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          Thanks Dhruba for the clear explanation.

          It seems to me that this problem is not as simple as initially suggested. Could we have a self-contained design doc?

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - Thanks Dhruba for the clear explanation. It seems to me that this problem is not as simple as initially suggested. Could we have a self-contained design doc?
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          I got permission from Alex Smith to post his script that calculates probabilities here. I'll do that in a minute, together with a wrapper I wrote that, given some cluster configurations, displays the data loss probabilities of the following approaches:

          • DEFAULT: The default block placement policy used by HDFS
          • RING GROUPS: The algorithm I described
          • DISJOINT GROUPS: Separating the cluster in disjoint node groups as Joy proposed.

          Both RING and DISJOINT depend on windows of choice (node groups) that I define two dimensions: racks and machines_within_rack. Thus, the total size of the window is racks * machines_within_rack machines.

          Below you will find some numbers. You might notice that the difference between RING and DISJOINT is not as big as one could possibly think. Assuming the scripts are right, here is a quick explanation.

          RING doesn't have as much freedom in selecting triples inside the window as DISJOINT has. In RING, the first replica is fixed and defines the group. So, the space of choice is limited to only two replicas that must be in the same rack.

          DISJOINT allows the primary replica to be in any rack in the window, allowing more triples.

          This was part of the confusion I was having understanding the 'node groups' definition and math, since it was not taking this factor into consideration.

          Here are the numbers:

          ===== 6 racks of 20 machines = 120 machines =====

          DEFAULT => 0.00010667

          RING GROUPS (window = 2 racks, 5 machines) => 2.37756e-06
          DISJOINT GROUPS (window = 2 racks, 5 machines) => 1.19392e-06

          RING GROUPS (window = 2 racks, 10 machines) => 1.04558e-05
          DISJOINT GROUPS (window = 2 racks, 10 machines) => 5.3346e-06

          RING GROUPS (window = 2 racks, 20 machines) => 3.97347e-05
          DISJOINT GROUPS (window = 2 racks, 20 machines) => 2.22069e-05

          RING GROUPS (window = 3 racks, 5 machines) => 4.77748e-06
          DISJOINT GROUPS (window = 3 racks, 5 machines) => 2.38034e-06

          RING GROUPS (window = 3 racks, 10 machines) => 2.01693e-05
          DISJOINT GROUPS (window = 3 racks, 10 machines) => 1.06015e-05

          RING GROUPS (window = 3 racks, 20 machines) => 6.95331e-05
          DISJOINT GROUPS (window = 3 racks, 20 machines) => 4.38487e-05

          RING GROUPS (window = 6 racks, 5 machines) => 1.15237e-05
          DISJOINT GROUPS (window = 6 racks, 5 machines) => 5.91359e-06

          RING GROUPS (window = 6 racks, 10 machines) => 4.57052e-05
          DISJOINT GROUPS (window = 6 racks, 10 machines) => 2.61534e-05

          RING GROUPS (window = 6 racks, 20 machines) => 0.00010667
          DISJOINT GROUPS (window = 6 racks, 20 machines) => 0.00010667

          ===== 50 racks of 20 machines = 1000 machines =====

          DEFAULT => 0.00584495

          RING GROUPS (window = 2 racks, 5 machines) => 1.99588e-05
          DISJOINT GROUPS (window = 2 racks, 5 machines) => 9.94932e-06

          RING GROUPS (window = 2 racks, 10 machines) => 8.92791e-05
          DISJOINT GROUPS (window = 2 racks, 10 machines) => 4.44541e-05

          RING GROUPS (window = 2 racks, 20 machines) => 0.000369203
          DISJOINT GROUPS (window = 2 racks, 20 machines) => 0.000185042

          RING GROUPS (window = 5 racks, 5 machines) => 7.92698e-05
          DISJOINT GROUPS (window = 5 racks, 5 machines) => 3.94854e-05

          RING GROUPS (window = 5 racks, 10 machines) => 0.000348759
          DISJOINT GROUPS (window = 5 racks, 10 machines) => 0.000174956

          RING GROUPS (window = 5 racks, 20 machines) => 0.00133532
          DISJOINT GROUPS (window = 5 racks, 20 machines) => 0.000716171

          RING GROUPS (window = 10 racks, 5 machines) => 0.000174504
          DISJOINT GROUPS (window = 10 racks, 5 machines) => 8.85916e-05

          RING GROUPS (window = 10 racks, 10 machines) => 0.000742117
          DISJOINT GROUPS (window = 10 racks, 10 machines) => 0.00038451

          RING GROUPS (window = 10 racks, 20 machines) => 0.00262672
          DISJOINT GROUPS (window = 10 racks, 20 machines) => 0.00153868

          ===== 100 racks of 20 machines = 2000 machines =====

          DEFAULT => 0.0160829

          RING GROUPS (window = 2 racks, 5 machines) => 4.04848e-05
          DISJOINT GROUPS (window = 2 racks, 5 machines) => 1.98985e-05

          RING GROUPS (window = 2 racks, 10 machines) => 0.000181715
          DISJOINT GROUPS (window = 2 racks, 10 machines) => 8.89063e-05

          RING GROUPS (window = 2 racks, 20 machines) => 0.000737402
          DISJOINT GROUPS (window = 2 racks, 20 machines) => 0.000370051

          RING GROUPS (window = 10 racks, 5 machines) => 0.000352741
          DISJOINT GROUPS (window = 10 racks, 5 machines) => 0.000177175

          RING GROUPS (window = 10 racks, 10 machines) => 0.00151145
          DISJOINT GROUPS (window = 10 racks, 10 machines) => 0.000768873

          RING GROUPS (window = 10 racks, 20 machines) => 0.00550483
          DISJOINT GROUPS (window = 10 racks, 20 machines) => 0.00307498

          RING GROUPS (window = 25 racks, 5 machines) => 0.000906714
          DISJOINT GROUPS (window = 25 racks, 5 machines) => 0.000452064

          Show
          rschmidt Rodrigo Schmidt added a comment - I got permission from Alex Smith to post his script that calculates probabilities here. I'll do that in a minute, together with a wrapper I wrote that, given some cluster configurations, displays the data loss probabilities of the following approaches: DEFAULT: The default block placement policy used by HDFS RING GROUPS: The algorithm I described DISJOINT GROUPS: Separating the cluster in disjoint node groups as Joy proposed. Both RING and DISJOINT depend on windows of choice (node groups) that I define two dimensions: racks and machines_within_rack. Thus, the total size of the window is racks * machines_within_rack machines. Below you will find some numbers. You might notice that the difference between RING and DISJOINT is not as big as one could possibly think. Assuming the scripts are right, here is a quick explanation. RING doesn't have as much freedom in selecting triples inside the window as DISJOINT has. In RING, the first replica is fixed and defines the group. So, the space of choice is limited to only two replicas that must be in the same rack. DISJOINT allows the primary replica to be in any rack in the window, allowing more triples. This was part of the confusion I was having understanding the 'node groups' definition and math, since it was not taking this factor into consideration. Here are the numbers: ===== 6 racks of 20 machines = 120 machines ===== DEFAULT => 0.00010667 RING GROUPS (window = 2 racks, 5 machines) => 2.37756e-06 DISJOINT GROUPS (window = 2 racks, 5 machines) => 1.19392e-06 RING GROUPS (window = 2 racks, 10 machines) => 1.04558e-05 DISJOINT GROUPS (window = 2 racks, 10 machines) => 5.3346e-06 RING GROUPS (window = 2 racks, 20 machines) => 3.97347e-05 DISJOINT GROUPS (window = 2 racks, 20 machines) => 2.22069e-05 RING GROUPS (window = 3 racks, 5 machines) => 4.77748e-06 DISJOINT GROUPS (window = 3 racks, 5 machines) => 2.38034e-06 RING GROUPS (window = 3 racks, 10 machines) => 2.01693e-05 DISJOINT GROUPS (window = 3 racks, 10 machines) => 1.06015e-05 RING GROUPS (window = 3 racks, 20 machines) => 6.95331e-05 DISJOINT GROUPS (window = 3 racks, 20 machines) => 4.38487e-05 RING GROUPS (window = 6 racks, 5 machines) => 1.15237e-05 DISJOINT GROUPS (window = 6 racks, 5 machines) => 5.91359e-06 RING GROUPS (window = 6 racks, 10 machines) => 4.57052e-05 DISJOINT GROUPS (window = 6 racks, 10 machines) => 2.61534e-05 RING GROUPS (window = 6 racks, 20 machines) => 0.00010667 DISJOINT GROUPS (window = 6 racks, 20 machines) => 0.00010667 ===== 50 racks of 20 machines = 1000 machines ===== DEFAULT => 0.00584495 RING GROUPS (window = 2 racks, 5 machines) => 1.99588e-05 DISJOINT GROUPS (window = 2 racks, 5 machines) => 9.94932e-06 RING GROUPS (window = 2 racks, 10 machines) => 8.92791e-05 DISJOINT GROUPS (window = 2 racks, 10 machines) => 4.44541e-05 RING GROUPS (window = 2 racks, 20 machines) => 0.000369203 DISJOINT GROUPS (window = 2 racks, 20 machines) => 0.000185042 RING GROUPS (window = 5 racks, 5 machines) => 7.92698e-05 DISJOINT GROUPS (window = 5 racks, 5 machines) => 3.94854e-05 RING GROUPS (window = 5 racks, 10 machines) => 0.000348759 DISJOINT GROUPS (window = 5 racks, 10 machines) => 0.000174956 RING GROUPS (window = 5 racks, 20 machines) => 0.00133532 DISJOINT GROUPS (window = 5 racks, 20 machines) => 0.000716171 RING GROUPS (window = 10 racks, 5 machines) => 0.000174504 DISJOINT GROUPS (window = 10 racks, 5 machines) => 8.85916e-05 RING GROUPS (window = 10 racks, 10 machines) => 0.000742117 DISJOINT GROUPS (window = 10 racks, 10 machines) => 0.00038451 RING GROUPS (window = 10 racks, 20 machines) => 0.00262672 DISJOINT GROUPS (window = 10 racks, 20 machines) => 0.00153868 ===== 100 racks of 20 machines = 2000 machines ===== DEFAULT => 0.0160829 RING GROUPS (window = 2 racks, 5 machines) => 4.04848e-05 DISJOINT GROUPS (window = 2 racks, 5 machines) => 1.98985e-05 RING GROUPS (window = 2 racks, 10 machines) => 0.000181715 DISJOINT GROUPS (window = 2 racks, 10 machines) => 8.89063e-05 RING GROUPS (window = 2 racks, 20 machines) => 0.000737402 DISJOINT GROUPS (window = 2 racks, 20 machines) => 0.000370051 RING GROUPS (window = 10 racks, 5 machines) => 0.000352741 DISJOINT GROUPS (window = 10 racks, 5 machines) => 0.000177175 RING GROUPS (window = 10 racks, 10 machines) => 0.00151145 DISJOINT GROUPS (window = 10 racks, 10 machines) => 0.000768873 RING GROUPS (window = 10 racks, 20 machines) => 0.00550483 DISJOINT GROUPS (window = 10 racks, 20 machines) => 0.00307498 RING GROUPS (window = 25 racks, 5 machines) => 0.000906714 DISJOINT GROUPS (window = 25 racks, 5 machines) => 0.000452064
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          BTW, The numbers assume that p (probability of a single node failing) is 0.001.

          Show
          rschmidt Rodrigo Schmidt added a comment - BTW, The numbers assume that p (probability of a single node failing) is 0.001.
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          Script written by Alex Smith to calculate data loss probabilities of different HDFS block placement policies.

          Show
          rschmidt Rodrigo Schmidt added a comment - Script written by Alex Smith to calculate data loss probabilities of different HDFS block placement policies.
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          Script I wrote to wrap Alex's program and give the results of different proposals in different cluster settings.

          Show
          rschmidt Rodrigo Schmidt added a comment - Script I wrote to wrap Alex's program and give the results of different proposals in different cluster settings.
          Hide
          scott_carey Scott Carey added a comment -

          This needs to change "p" from a constant, to a function of the TTR window.

          "probablility of a single node failing" alone is meaningless, its concurrent failure that is the issue. The odds of concurrent node failure is linearly proportional to TTR. I think this model needs to assume one failure at odds = 1.0, then use the odds of concurrent failure for the next 2 failures within the time window. A 'constant' chance of failure begs the question, ".001 chance of failure per what?" The first failure happens, that is assumed. Then the next two happen given odds within a time window.

          Assuming hadoop failure replication is optimized (which it isn't, the DN dishes out block replication requests too slow).
          TTR is inversely proportional to the number of racks in a group for rack failure.
          TTR is inversely proportional to the number or racks in a group for single node failure IF the combined bandwidth of the machines in the group in a rack is at least 2x than the between-rack bandwidth, otherwise it is inversely proportional to the ratio of rack bandwidth to node group bandwidth.

          The result is that only the "medium" sized groups above are viable, else it takes too long to get data replicated when a failure happens. Also, the TTR affects the odds of data loss on larger replication counts disproporionatly.

          Show
          scott_carey Scott Carey added a comment - This needs to change "p" from a constant, to a function of the TTR window. "probablility of a single node failing" alone is meaningless, its concurrent failure that is the issue. The odds of concurrent node failure is linearly proportional to TTR. I think this model needs to assume one failure at odds = 1.0, then use the odds of concurrent failure for the next 2 failures within the time window. A 'constant' chance of failure begs the question, ".001 chance of failure per what ?" The first failure happens, that is assumed. Then the next two happen given odds within a time window. Assuming hadoop failure replication is optimized (which it isn't, the DN dishes out block replication requests too slow). TTR is inversely proportional to the number of racks in a group for rack failure. TTR is inversely proportional to the number or racks in a group for single node failure IF the combined bandwidth of the machines in the group in a rack is at least 2x than the between-rack bandwidth, otherwise it is inversely proportional to the ratio of rack bandwidth to node group bandwidth. The result is that only the "medium" sized groups above are viable, else it takes too long to get data replicated when a failure happens. Also, the TTR affects the odds of data loss on larger replication counts disproporionatly.
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          Hi Scott,

          I totally understand your concerns. BTW, thanks for the thorough analysis you gave. That was impressive!

          I would say that the main problem is the complexity of the calculations. It was already hard to calculate things with this simplified model. Adding more variables would be hard. Besides, I think we are more interested in permanent failures – those that cannot be recovered --, since we are trying to reduce the odds of permanently losing data (in an unrecoverable way).

          I'm not saying we should not use TTR. I'm just saying that we didn't aim for that in our evaluation.

          I guess the best thing we wanted to take from these numbers was the comparison between the different algorithms, and whether it was worth changing the block placement policy or not.

          I think this is a great discussion JIRA and I'm really happy with all the great ideas people have been giving. Having said that, it's clear to me that there is no definite solution to the problem. Depending on how you approach it, different policies will be the optimal ones. For instance, take DEFAULT, RING, and DISJOINT. They are all great and valid approaches, each one with its ups and downs. I believe the main result of this JIRA is that people will start creating different probability models and algorithms depending on their use cases, and we will see several block placement policies coming out of it.

          Show
          rschmidt Rodrigo Schmidt added a comment - Hi Scott, I totally understand your concerns. BTW, thanks for the thorough analysis you gave. That was impressive! I would say that the main problem is the complexity of the calculations. It was already hard to calculate things with this simplified model. Adding more variables would be hard. Besides, I think we are more interested in permanent failures – those that cannot be recovered --, since we are trying to reduce the odds of permanently losing data (in an unrecoverable way). I'm not saying we should not use TTR. I'm just saying that we didn't aim for that in our evaluation. I guess the best thing we wanted to take from these numbers was the comparison between the different algorithms, and whether it was worth changing the block placement policy or not. I think this is a great discussion JIRA and I'm really happy with all the great ideas people have been giving. Having said that, it's clear to me that there is no definite solution to the problem. Depending on how you approach it, different policies will be the optimal ones. For instance, take DEFAULT, RING, and DISJOINT. They are all great and valid approaches, each one with its ups and downs. I believe the main result of this JIRA is that people will start creating different probability models and algorithms depending on their use cases, and we will see several block placement policies coming out of it.
          Hide
          scott_carey Scott Carey added a comment -

          I think we are more interested in permanent failures - those that cannot be recovered

          That simplifies things. We can ignore rack failure, which is predominantly an availability problem not a data loss problem.

          Then, that makes the TTR issue primarily about how many nodes in a group per rack. So the short answer for keeping TTR from growing too large is that the number of nodes in a rack.

          R = replication count.
          Lets say that p_0 is our baseline probability that R nodes will simultaneously fail. (R=3, p_0 = 0.001 in the calculations above)

          Now lets find p, the adjusted probability taking TTR into account.

          N = machines per rack in a group.
          RB = rack bandwidth.
          NB = node bandwidth.

          if (NB * N) / RB >= (R-1), then p = p_0.
          else, p = p_0 * ((RB * (R-1)) / (NB * N)) ^ (R-1).

          Assuming p_0 = 0.001 and R = 3 this is more clearly presented as:

          IF (NB * N) >= (RB * 2)
          p = 0.001
          ELSE
          p = 0.001 * ((2* RB)/(NB * N)) ^2

          If Rack Bandwidth is 10x node bandwidth, then this is:
          IF N >= 20, then p = 0.001
          ELSE p = 0.001 * (20/N) ^ 2

          In the 100 rack, 2000 node example you have the below as a subsection:

          RING GROUPS (window = 10 racks, 5 machines) => 0.000352741
          DISJOINT GROUPS (window = 10 racks, 5 machines) => 0.000177175
          
          RING GROUPS (window = 10 racks, 10 machines) => 0.00151145
          DISJOINT GROUPS (window = 10 racks, 10 machines) => 0.000768873
          
          RING GROUPS (window = 10 racks, 20 machines) => 0.00550483
          DISJOINT GROUPS (window = 10 racks, 20 machines) => 0.00307498
          

          I'm not sure if RING GROUPS are better at TTR issues than DISJOINT GROUPS, I haven't thought through that one. But assuming its the same then adjusting for TTR gives these adjusted data loss odds for the above:

          (TTR is 4x longer, so probability of loss is 16x)
          RING GROUPS (window = 10 racks, 5 machines) => 0.005643856
          DISJOINT GROUPS (window = 10 racks, 5 machines) => 0.0028348
          
          (TTR is 2x longer so probability of loss is 4x)
          RING GROUPS (window = 10 racks, 10 machines) => 0.0060458
          DISJOINT GROUPS (window = 10 racks, 10 machines) => 0.003075492
          
          RING GROUPS (window = 10 racks, 20 machines) => 0.00550483
          DISJOINT GROUPS (window = 10 racks, 20 machines) => 0.00307498
          

          Interestingly, this almost exactly compensates for the shrinking of group size once TTR is limited to the intra-rack bandwidth in the group.

          Note that p_0 itself increases with the cluster size. The more nodes, the higher the likelihood of co-ocurrance of node failure.

          My conclusion is that this is useful especially for large clusters with large numbers of nodes per rack, or larger ratios of intra-rack bandwidth to inter-rack bandwidth. For clusters on the other end of those spectrums, its hard to beat the current placement algorithm as long as replication of missing blocks is done at maximum pace.
          Of course in the real world, the Namenode does not issue block replication requests at the maximum pace. I did some tests a few days ago with three scenarios (decommission, node error, missing node at startup) and found that the bottleneck in replication is how fast the NN schedules block replication, not the network or the data nodes. It schedules block replication in batches that are too small to saturate the network or disks, and does not schedule batches aggressively enough. So one way to increase data reliability in the cluster is to work on that and therefore reduce TTR.

          Show
          scott_carey Scott Carey added a comment - I think we are more interested in permanent failures - those that cannot be recovered That simplifies things. We can ignore rack failure, which is predominantly an availability problem not a data loss problem. Then, that makes the TTR issue primarily about how many nodes in a group per rack. So the short answer for keeping TTR from growing too large is that the number of nodes in a rack. R = replication count. Lets say that p_0 is our baseline probability that R nodes will simultaneously fail. (R=3, p_0 = 0.001 in the calculations above) Now lets find p, the adjusted probability taking TTR into account. N = machines per rack in a group. RB = rack bandwidth. NB = node bandwidth. if (NB * N) / RB >= (R-1), then p = p_0. else, p = p_0 * ((RB * (R-1)) / (NB * N)) ^ (R-1). Assuming p_0 = 0.001 and R = 3 this is more clearly presented as: IF (NB * N) >= (RB * 2) p = 0.001 ELSE p = 0.001 * ((2* RB)/(NB * N)) ^2 If Rack Bandwidth is 10x node bandwidth, then this is: IF N >= 20, then p = 0.001 ELSE p = 0.001 * (20/N) ^ 2 In the 100 rack, 2000 node example you have the below as a subsection: RING GROUPS (window = 10 racks, 5 machines) => 0.000352741 DISJOINT GROUPS (window = 10 racks, 5 machines) => 0.000177175 RING GROUPS (window = 10 racks, 10 machines) => 0.00151145 DISJOINT GROUPS (window = 10 racks, 10 machines) => 0.000768873 RING GROUPS (window = 10 racks, 20 machines) => 0.00550483 DISJOINT GROUPS (window = 10 racks, 20 machines) => 0.00307498 I'm not sure if RING GROUPS are better at TTR issues than DISJOINT GROUPS, I haven't thought through that one. But assuming its the same then adjusting for TTR gives these adjusted data loss odds for the above: (TTR is 4x longer, so probability of loss is 16x) RING GROUPS (window = 10 racks, 5 machines) => 0.005643856 DISJOINT GROUPS (window = 10 racks, 5 machines) => 0.0028348 (TTR is 2x longer so probability of loss is 4x) RING GROUPS (window = 10 racks, 10 machines) => 0.0060458 DISJOINT GROUPS (window = 10 racks, 10 machines) => 0.003075492 RING GROUPS (window = 10 racks, 20 machines) => 0.00550483 DISJOINT GROUPS (window = 10 racks, 20 machines) => 0.00307498 Interestingly, this almost exactly compensates for the shrinking of group size once TTR is limited to the intra-rack bandwidth in the group. Note that p_0 itself increases with the cluster size. The more nodes, the higher the likelihood of co-ocurrance of node failure. My conclusion is that this is useful especially for large clusters with large numbers of nodes per rack, or larger ratios of intra-rack bandwidth to inter-rack bandwidth. For clusters on the other end of those spectrums, its hard to beat the current placement algorithm as long as replication of missing blocks is done at maximum pace. Of course in the real world, the Namenode does not issue block replication requests at the maximum pace. I did some tests a few days ago with three scenarios (decommission, node error, missing node at startup) and found that the bottleneck in replication is how fast the NN schedules block replication, not the network or the data nodes. It schedules block replication in batches that are too small to saturate the network or disks, and does not schedule batches aggressively enough. So one way to increase data reliability in the cluster is to work on that and therefore reduce TTR.
          Hide
          shv Konstantin Shvachko added a comment -

          > We can ignore rack failure, which is predominantly an availability problem not a data loss problem.

          We should NOT ignore rack failures. If this simplifies the probabilistic models it's fine, but in practice rack failures should be accounted for. If data is not available for hours as a result of this, which is typical, the clients will start complaining. Also this would be a degradation from the current policy.

          Show
          shv Konstantin Shvachko added a comment - > We can ignore rack failure, which is predominantly an availability problem not a data loss problem. We should NOT ignore rack failures. If this simplifies the probabilistic models it's fine, but in practice rack failures should be accounted for. If data is not available for hours as a result of this, which is typical, the clients will start complaining. Also this would be a degradation from the current policy.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          I'm trying to understand this, it'll take me a while to get round the maths.

          1. If you keep stuff all off the same switch, replication costs are lower, maybe P(loss) is lower, but the cost of a network partition is higher, as then the probability of any copy of the data being visible is reduced. That may or may not be something to worry about. Konstantin does, clearly; I think it may be best to detect a partition event and maybe drop into safe mode if a significant percentage of the cluster just goes away. This depends on what you want: replication storm and possible cascade failures vs hard outage. Either way, the ops team get paged.
          1. If your cluster is partitioned into independent power sources/UPSes, then again, less independence. Failure of the power source takes a big chunk of the cluster offline. This won't look significantly different to the NN/JT, unless they are hooked into events from the UPS.
          1. Batches of HDDs may suffer from the same flaws. In an ideal world, you'd know the history of every HDD and avoid having all copies of the data on same the same batch of disks until you considered them bedded in. This would imply knowing about the internal disk state of every datanode though...
          Show
          stevel@apache.org Steve Loughran added a comment - I'm trying to understand this, it'll take me a while to get round the maths. If you keep stuff all off the same switch, replication costs are lower, maybe P(loss) is lower, but the cost of a network partition is higher, as then the probability of any copy of the data being visible is reduced. That may or may not be something to worry about. Konstantin does, clearly; I think it may be best to detect a partition event and maybe drop into safe mode if a significant percentage of the cluster just goes away. This depends on what you want: replication storm and possible cascade failures vs hard outage. Either way, the ops team get paged. If your cluster is partitioned into independent power sources/UPSes, then again, less independence. Failure of the power source takes a big chunk of the cluster offline. This won't look significantly different to the NN/JT, unless they are hooked into events from the UPS. Batches of HDDs may suffer from the same flaws. In an ideal world, you'd know the history of every HDD and avoid having all copies of the data on same the same batch of disks until you considered them bedded in. This would imply knowing about the internal disk state of every datanode though...
          Hide
          aw Allen Wittenauer added a comment -

          With pluggable block placement, I'm not sure where we sit with this JIRA.... so.... ping!

          Show
          aw Allen Wittenauer added a comment - With pluggable block placement, I'm not sure where we sit with this JIRA.... so.... ping!

            People

            • Assignee:
              rschmidt Rodrigo Schmidt
              Reporter:
              dhruba dhruba borthakur
            • Votes:
              0 Vote for this issue
              Watchers:
              46 Start watching this issue

              Dates

              • Created:
                Updated:

                Development