Details

Type: Improvement

Status: Open

Priority: Major

Resolution: Unresolved

Affects Version/s: None

Fix Version/s: None

Component/s: namenode

Labels:None
Description
The current HDFS implementation specifies that the first replica is local and the other two replicas are on any two random nodes on a random remote rack. This means that if any three datanodes die together, then there is a nontrivial probability of losing at least one block in the cluster. This JIRA is to discuss if there is a better algorithm that can lower probability of losing a block.

 calculate_probs.py
 2 kB
 Rodrigo Schmidt

 failure_rate.py
 11 kB
 Rodrigo Schmidt

 prob.pdf
 67 kB
 Aravind Menon

 prob.pdf
 67 kB
 Aravind Menon
Issue Links
 is blocked by

HDFS1261 Important methods and fields of BlockPlacementPolicyDefault.java shouldn't be private
 Open
 is related to

HDFS385 Design a pluggable interface to place replicas of blocks in HDFS
 Closed
 relates to

HBASE5843 Improve HBase MTTR  Mean Time To Recover
 Closed
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
Dhruba, doing more than one offswitch write might have significant negative performance sideeffect on the writepipeline.
Shouldn't HDFS385 be sufficient to address the current jira via a plugin for a 3rack block placement? Or is this jira meant to be about a plugin?
This is more of a discussion on how to design a nondefault policy, I do not plan to change the default policy of having only one offswitch write.
Also, option 1 has only one offswitch write.
Hey Dhruba,
How does this change the probability? Are you saying that P(rack r and rack r+1 fails) < P(rack i and rack j, i!=j) fails? Can you sketch out mathematically how this decreases the probability of failure?
Brian
The issue I see with 1 (which may not be a big deal) is that if more racks are added, then the placement policy cannot utilize the new, emptier machines more efficiently without rebalancing data. Would a hybrid help? Replica 1 is chosen at random from (r1) and the next replica using scheme 2 (excluding the node on rack r1).
@Brian:
I think this change protects against node failures more than rack failures. Lets say the replication factor is three. If you calculate the probability of data loss if any 3 machines fail in the cluster at the same time, then the placement matters:
1) If the blocks are scattered across all the machines, then any 3 machines failing will contribute towards data loss.
2) If a given block (by virtue of the new replication scheme) is placed only on a subset of machines, then the probability of data loss is reduced from any 3 machines failing to 3 machines in that subset failing. Failures of machines in the distinct subsets don't contribute to data loss.
Hey Karthik,
Let me play dumb (it might not be playing after all) and try to work out the math a bit.
First, let's assume that on any given day, a node has 1/1000 chance of failing.
CURRENT SCHEME: A block is on 3 random nodes. Probability of loss is a simultaneous failure of nodes X, Y, Z. Let's assume these are independent. P(X and Y and Z) = P(X) P(Y) P(Z) = 1 in a billion.
PROPOSED SCHEME: Well, the probability is the same.
So, given a specific block, we don't change the probability it is lost.
What you seem to be calculating is the probability that three nodes go down out of N nodes:
P(nodes X, Y, and Z fail for any three distinct X, Y, Z) = 1  P(N3 nodes stay up) = 1  [999/1000]^[N3]
Sure enough, if you use a small subset (N=40 maybe), then the probability of 3 nodes failing is smaller for small subsets than the whole cluster.
However, that's not the number you want! You want the probability that any block is lost when three nodes go down. That is, P(nodes X, Y, and Z fail for any three distinct X, Y, Z and X, Y, Z have at least one distinct block) (call this P_1). Assuming that overlapping blocks, node death, and subset of nodes are all independent, you get:
P_1 = P(three nodes having at least one common block) * P(3 node death) * (# of distinct 3node subsets)
The first number is decreasing with N, the second is constant with N, the third is increasing with N. The third is a wellknown formula, while I don't have a good formula for the first value. Unless you can calculate or estimate the first, I don't think you can really say anything about decreasing the value of P_1.
I think we are incorrectly assuming the probability of data loss as being proportional to to the probability of 3 machines in a subset being lost without taking into account the probability of common blocks. The probabilities get tricky, hence me asking for someone to sketch it out mathematically...
I think Brian is right. Dhruba's proposal doesn't change the probability of data loss given that F machines fail.
Where you place blocks will only impact the failure probability of that block, and only if nodes have different probabilities of failure per machine. Placing blocks on a circular arrangement (Dhruba's proposal) or randomly doesn't affect the expected number of missing blocks given F failures because they are both uniform placement strategies. And if we go for a nonuniform placement strategy, we will only make the worst case worse. What I mean is that the current block placement strategy is already optimal and we can't make the expected number of missing blocks better by changing just block placement. We can only improve it by changing the redundancy scheme that we use (e.g., adding correction codes or more replicas).
If you are wondering what is the expected number of blocks that go missing given F failures, here is the math:
Consider a rway replication scheme and the following variables....
r = number of replicas used
F = number of machines that fail concurrently
N = number of machines
B = number of blocks (without counting replication)
I use the prefix notation C(N,a) to denote the number of ways you can pick a elements out of N.
With N machines, we have C(N,r) possible replica placements for a single block. Given that F machines fail, we have C(F,r) possible placements within those F failed machines.
If the block placement strategy is uniform (i.e., it doesn't favor some specific nodes or racks more than others), the probability of a block having all its replicas within the failed machines is
C(F,r) / C(N,r) =
(F) * (F1) * (F2) * ... * (F  r + 1)

(N) * (N1) * (N2) * ... * (N  r + 1)
The expected number of missing blocks when F machines fail should be
B * [formula above]
If B=30M, N = 1000 and F = 3, r=3, the expected number of missing blocks would be about 0.18
Please someone correct if my math or rationale is wrong.
A funny related anecdote that I've heard thirdhand. I could never trace down the authenticity, but I found it amusing 
A large physics experiment once tried to track down and classify as many errors in their simulation software was possible. After they removed the known source of errors, they took the remaining unreproducible errors and mapped them to the worker nodes. Then, they took the list of worker nodes and mapped to where they were in the machine room. Sure enough, all the unreproducible errors could be tracked to the top two nodes in the rack.
So, if you put all the copies at the same height on the rack, the probability of losing the files at the top of the rack is definitely higher than the probability of losing the bottom of the rack.
I think there is a slight change in the way probability should be calculated if the block placement policy enforces that certain blocks reside on a subset of machines. Nevertheless, I went with the probability of losing data as opposed to the expected number of block losses.
Scheme 1  pick any machine and put blocks there. Further, assume that f = r in your example.
P(of losing data given r failures)
= P(of losing at least 1 block)
= 1  P(of not losing any block)
= 1  (P(of not losing a specific block) ^ B)
= 1  ((1  1/C(N,r)) ^ B)
Scheme 2  assume that you have a fixed pool of machines that you replicate blocks to. For simplicity, I am going to assume what this means is that there are K machines that contain a set of blocks and all their replicas. So there are (N/K) such sets of machines. Further, assuming an even distribution, there are only B/(N/K) blocks in this set of K machines.
P(of losing data given r failures)
= P(r failures being in one set of K machines) * P(of losing at least 1 block in that set)
P(r failures being in one set of K machines) = C(N/K,1)*C(K,r)/C(N,r)
P(of losing at least 1 block in that set) = 1  ((1  1/C(K,r)) ^ (B/(N/K))) > this follows from the fact that there are K nodes and B/(N/K) blocks.
Plugging in B=30M, N = 1000 and F = 3, r=3, K=60 (replicate all blocks in the previous and next rack, 20 machines per rack):
Scheme 1 : P(data loss) = 1  ((1  1/C(1000,3)) ^30) = 0.165
Scheme 2 : P(data loss) = P(r failures being in one set of K machines)*P(of losing at least 1 block in that set) = 0.0034 * 1 = 0.0034
Am I doing something wrong?
Just synced up with some folks (including Rodrigo), I think the mismatch was from the following:
The expected number of blocks lost is the same in both cases. This is because when 3 nodes die:
In scheme 1, there is a high probability data is lost, and a little of it is lost when it happens.
In scheme 2, there is a low probability data is lost, but more of it is lost when it happens.
The product of the probability and the number of blocks lost is the same. That gives us 2 choices  reduce the probability or the number of blocks lost.
The underlying issue is that any data loss is bad be it a little or a lot (especially for some kinds of applications). So we are better off lowering the probability that any data is lost than the number of blocks lost at a time. From that perspective, this is a good change.
Yes, if we want to minimize the chances of having ANY data loss, restricting block replication to a subset of the machines is more effective.
Hey Karthik,
Why don't we take this to the extreme and set K=r? You can do this without introducing a second offrack copy in many cases.
Then, the probability of any data loss is fixed at 1/C(N, r); for your sample case, this is 6.01e8, quite small.
Brian
Hey Brian,
Yes, that would help with the data loss case but has some unfortunate side effects:
1. Data availability would be reduced  cannot access the data at all on interrack switch failure.
2. It would cause the data to be very unevenly distributed across the various racks.
Another workaround for this is to have many smaller clusters instead of one large cluster, but that becomes a maintenance nightmare (operationally), and the application above should balance the data across these clusters.
> P(of not losing a specific block) = 1  1/C(N,r)
Is that right?
I think you are throwing one replica at a time on the cluster. The probability of first missing the failed nodes is (Nr)/N. The probability of the second falling into live node excluding the one that already has the first replica is (Nr1)/N. And so on.
It would be good if somebody could combine calculations in a pdf (or html) doc and attach to this jira. We can link it from FAQ then.
> I think you are throwing one replica at a time on the cluster. The probability of first missing the failed nodes is (Nr)/N. The probability of the second falling into live node excluding the one that already has the first replica is (Nr1)/N.
Shouldn't it be (Nr)/N , (Nr1)/N1 , (Nr2)/N2 and so on ?
Similarly, probability that all replicas reside on the failed nodes, it would be
r/N * (r1)/(N1) * ... * 1/(Nr+1) = 1/C(N,r) which is same as in Karthik's formula.
Yes, I missed to decrement the total number of nodes after the first replica is placed. Should be (Nr1)/(N1) instead of (Nr1)/N, which leads to Karthik's formula. I stand corrected.
Hi,
We did some analysis of the data loss probability in HDFS under different block placement schemes and replication factors. We consider two simple placement schemes: 1. random placement, where each data block can be placed at random on any machine, and 2. inrack placement, where all replicas of a block are placed in the same rack. The detailed analysis of these scenarios is covered in the attached pdf.
The main observations from the analysis are:
1. Probability of data loss increases with cluster size
2. Probability of data loss is lower with inrack placement than with random placement
3. Probability of data loss is lower with higher degree of replication
Regards,
Aravind
Fixed a minor typo in the pdf.
Aravind
From all the above mathematical formulae, can we then say that the policy listed below can be a good first step:
"The first replica is on a node on rack r. Then the other two replicas be randomly selected nodes on either rack r1 or r+1. In this approach, three datanodes in two consecutive racks have to fail simultaneously for a block loss to occur. This is better than the current implementation where any three datanode failures in the entire cluster can cause one block to be lost."
The NetworkTopology class in the namenode already has information about node to rack location.
For this JIRA I'm implementing a general idea, in which configuration variables determine a subset of the racks that can be used for each block and a subset of machines within that rack.
However, the NetworkTopology class from Hadoop Common lacks methods to list racks or limit machines within a rack. I think this is fundamental for this JIRA or any other complex block placement policy (NetworkTopology has a very limited API).
Even though it's possible to circumvent the NetworkTopology limitations by accessing the namenode directly (through FSNameSystem passed as FSClusterStats to BlockPlacementPolicy), this makes the code really ugly.
I would like to get the general opinion about extending NetworkTopology from hadoop common.
Quoted from prob.pdf:
We consider the probability of data loss in a cluster of n machines, containing b data blocks replicated
randomly with a replication factor of r. Let us call this probability p_loss _{random}(n, b, r).
It seems that the probability model described in prob.pdf is incorrect. Data loss probability P depends on time T. The longer time the cluster runs, the higher the probability of data loss. The authors seems assuming P is independent from T.
There is a number of simplifications there, including the block placement algorithms. This was done to simplify the math.
We now have better models for the algorithms, which we should post here soon. However, timedependent failure probabilities would make calculations overly complicated. It would be great if someone could work that out.
> There is a number of simplifications there, ...
I see. You may want to include a section in the beginning to list out all the simplifications.
That might make it hard to start reading, since no context will be given to the reader.
I think Aravind tried to list the whole set of assumptions in the beginning of section 2. Maybe we just have to make it clearer.
Just curious. Are you getting missing/corrupt blocks for those with replication of 3 on your clusters?
@Koji: we have files with replication factor of 3. if a large number of datanodes fail at the same time, we do see missing blocks. Sometimes, the datanode process on these machines fail to start even after repeated startdfs.sh attempts, sometimes the entire machine fails to reboot. Then we have to manually fix a few of those bad datanode machines and make them come online; this fixes the "missing blocks" problem but is a manual process and is painful.
This policy seems to decrease the aggregate read bandwidth of a file.
@Hairong: I'm not sure if I understand your point. Can you explain it a little more?
Suppose a file has N blocks. The proposed policy places all N blocks in at most 3 racks, while the current default policy could place them in up to N+1 racks, which represent a bigger aggregate read bandwidth when N>2.
Hi hairong; does your comment about aggregate readbandwidth refer to the the use case when there are multiple simultaneous readers of the same file?
@Hairong: Just a few observations
1) The policy I'm implementing doesn't fix the number of racks to 3 and I don't think anyone agrees that should be the case. Dhruba mentioned this static 3rack policy in this discussion, but I guess that was just an example of how block placement can reduce the probability of failure.
2) There is a clear tradeoff here. The more you restrict where you can place blocks, the worse your maximum bandwidth and load balance will tend to be. (because you will have fewer ways to rearrange blocks to maximize these metrics). So, no free lunch.
3) Still, in some cases, we might want to reduce bandwidth and load balance possibilities for a lower probability of data loss, specially if HDFS is being used to store important data, backed up by some middleware that has a cache layer.
Having said that, the patch I'm working on is a configurable policy where configuration variables define windows in two spaces, racks and machines within a rack, in which blocks can be placed. Potentially, these windows can be set to be the entire cluster, which is exactly the default block placement policy. In the other extreme, such windows can be defined to be as restrict as only three machines, which will give the lowest probability of data loss but no freedom to improve load balance or bandwidth.
Please let me know if this is not clear enough.
Right, in case of a map/reduce application, multiple mappers are reading the same file simultaneously. Placing the blocks of a file in multiple racks greatly increases the probability of reading from a racklocal node.
Rodrigo, thanks for your explanation. Now I understand your proposal much better.
> We assume machine failures are independent.
I am not sure if I agree with this assumption. From the attached analysis, the inrack placement has the lowest data loss probability. This is counterintuitive. in reality, the chance of losing a rack is not small. So a block placement policy normally place a block in at least two racks.
The assumption that failures are independent was made to ease the calculations. Assuming that failures are dependent would add too much complexity to the problem.
In the implementation, we are still trying to keep blocks on separate racks and that's why I said there will be two windows: one for the remote rack to be chosen, and the other to limit the machines that can be chosen inside that rack (for each block that must be replicated).
but mapreduce programs rarely read from just one file. as long as the total input data spans most racks in a cluster  that should be good enough.
data population in most clusters (like in our case) is itself via parallel processes (whether it's crawlers or distcp like parallel copiers). these in turn are consumed by other mapreduce jobs and so on. based on this  i would imagine that most input data sets have a number of files populated via different nodes/writers (and hence likely to span, in totality, a large number of racks in the cluster).
More questions...
I understand that the proposed policy tries to reduce the data loss of a file by placing its blocks on a N datanodes, where N is equal to or less than the total number of datanodes in the HDFS cluster. Why does the policy need to limit the number of racks to choose from? It seems to me that the more racks, the better.
How do you plan to handle the balancer?
The higher the number of racks, the higher the latency for the write pipeline. Otherwise, the default block placement policy would use 3 different racks always instead of always having 2 replicas on the same rack.
I've had some general discussions with Dhruba about the balancer. It doesn't use the block placement policy, so I'll have to extend it separately.
I did not mean #racks per block. I mean racks for a file.
As Joydeep wrote, we didn't think this was a major problem. What is your proposal to fix that?
For a large file, it does matter especially in the use case of compacting large number of small files (like reduce results) into one by concatenating or archiving.
Anyway, no matter it matters or not, my question is why you want to have this rack limitation?
I don't think I understand your use case. It only seem to be advantageous to do what you say if there are multiple readers for the same file.
We designed it this way because it would be relatively easy to understand, implement, generalize, and plan for (as users).
But I'm quite open to options. What would you propose instead?
The math looks good to me (in pdf file).
> Data loss probability P depends on time T.
Here the assumption is, correct me if it's wrong, that f nodes fail simultaneously. Otherwise, we should take into account replication process, which will be restoring some blocks while other nodes are still up, decreasing the probability of data loss. Probability of losing f nodes simultaneously at a particular moment does not depend on time. The probability of a simultaneous failure of f nodes during a specific period of time depends on the length of the period. So if you choose the parameter p in the document correctly (depending on the time period), then you get the the probability of a data loss during this period of time.
The assumption p = 0.01 or 0.001 seems arbitrary, but it probably does not matter as you compare different strategies with the same value.
What is missing in the analysis is that the probability of loosing a whole rack is much higher than the probability of loosing any 20 machines in the cluster. It should be actually equivalent to the probability of loosing one machine, because you loose one switch and the whole rack is out.
And that was one of the main reasons why we decided to replicate off rack.
Rodrigo, did I understand correctly that your idea is to experiment with replication within the rack, that is, all replicas are placed on different machines in the same rack?
Konstantin: Re: <<< because you loose one switch and the whole rack is out.>>> The calculations were primarily to reason about data loss (multiple disk failures that could potentially lead to all replicas of a block getting lost) and strategies to deal with those rather than temporary data unavailability.
@Konstantin: The new policy will have, for every block, a limited window of racks you can choose from, and a limited window of machines within such racks. For every block, we will keep the idea of having a copy that is local to the writer, and two copies at a remote rack, but always respecting this limited window of choices.
> I don't think I understand your use case. It only seem to be advantageous to do what you say if there are multiple readers for the same file. We designed it this way because it would be relatively easy to understand, implement, generalize, and plan for (as users).
You still do not get my point. I think the goal of this policy is to reduce the data loss by limiting the # of nodes to place a file's data. Is this additional limitation of racks neccessary? Or are you saying this is just easy for you to implement. I do not see how this helps users understand or plan. In general, having less configuration parameters is easier for users to understand or plan.
I think I get your point. I don't think you get mine, though. I just want to know what else you have in mind.
Let me put it this way: If you were to implement this, what would you do?
Just complementing my previous comment, I don't know of a better way to implement this that wouldn't be either overly complicated, or hard to configure and understand. I'm open to other ideas, but you have to give me some.
Hi Rodrigo, I do not know your algorithm so I have no idea how relaxing the rack restriction would complicate your implementation.
Assume that a user wants to place a file's blocks to at most N datanodes and a cluster has R racks, if you place blocks to at most N/R datanodes per rack, is it a special case as your proposal? Of course, there are other algorithms too...
Hairong, would you mind explaining how this approach would improve the probability of data loss? I don't quite see it.
> Here the assumption is, correct me if it's wrong, that f nodes fail simultaneously. Otherwise, we should
> take into account replication process, ...
I totally agree that we should take account with replication process. That's also the reason that we cannot ignore time T in the equations. A model only consider data loss (i.e. disk failure) probability P without time T, in my opinion, is not useful because simultaneous multiple disk failure probability in real world is really low, probably negligible. I think it would happen only in sitewise disaster situation.
Rodrigo, have you observed any case of simultaneous multiple disk failure?
@Rodrigo  how about using a term to describe the groups of nodes that we are dividing the cluster into? (I will call it a 'nodegroup' for now).
@Hairong  if we drop the rack restriction  then it seems that each nodegroup will have to have at least 2 X #of racks. This is because we want to 2 nodes per rack for each nodegroup (for 2 rack local copies for the writer). So we are losing control of the size of the nodegroup. I think the N racks x M nodes definition is much more flexible  it gives control of how much rack fault tolerance we want (by controlling the parameter N) vs. how much fault isolation/rereplication bandwidth we want (by controlling the overall product NxM).
One other thing  i think contiguous racks have higher chance of correlated failures (because of shared physical location/temperature etc. as well as the fact that in a heterogenous cluster  new hardware may be added in contiguous racks). So the definition of N if better set to be alternating (or something in that sort).
One problem not discussed here (I think) is what happens when more racks are added (and the nodegroup definitions suddenly change  this is particularly problematic if some modulo arithmetic is used to choose racks for a nodegroup). Does the system rereplicate things to conform to new nodegroup definitions? (That would be awful). If not  how does it know not to?
regarding T and simulataneous failures:
 the smaller the nodegroup  the larger the recovery time (since the time to rereplicate data from a baddisk increases as only a smaller set of nodes have copies of the bad data). so there's a tension here between fault isolation and recovery time. However  it would seem that there are declining returns to rereplication bandwidth as the size of a nodegroup is increased.
it would be useful to know what this curve looks like (time to recovery versus size of nodegroup) to choose a good cutoff. this may also force one to choose larger number of racks (N) for a node group. I would be much more worried about rereplication bandwidth than IO bandwidth for mapreduce.
 i think we have seen large number of failures due to bad hardware (faulty set of nodes) or operator error (some set of nodes reimaged). (which is why i think the node group should not be contiguous in definition)
there are also events (cluster wide restart/reboot causing several nodes to not boot because of bad boot disk/sector) that cause simulataneous failure.
as the number of nodes has increased  what i have heard anecdotally is that at any given point in time  significant number of nodes are down and sometimes this causes blocks to go missing. so in that sense  i guess yeah  we do see simultaneous failures.
there are other strategies to decrease T (and thereby decrease overall failure prob.). If rereplication bandwidth is the constraining factor  a separate network (from mapreduce traffic) for rereplication/balancing may be the right solution. however i don't know if that is the case (in the case of these simultaneous failiures).
I've had some general discussions with Dhruba about the balancer. It doesn't use the block placement policy, so I'll have to extend it separately.
To me, the biggest causes of data risk seem to be parts of the system that don't adhere to good placement policy and not the default write placement policy. Furthermore this plan will decrease the probability of individual data loss but increase the amount of data lost when it does happen. Lastly, most clusters these days have 3 to 8 hard drives per node. This policy is blind to that, and it is a big factor in the sort of data loss concerned with here (a disk dead for good, not a temporary unavailability). It would seem that if you are trying to force a file to only exist on 3 nodes, you would want to go all the way down to 3 disks on 3 different machines. In fact, a policy that achieves some of the intended affect to some extent leaves the node placement exactly as it is now but requests that a file being written goes to a particular disk in a node (a large random number mod the number of disks in the node). If there were 4 disks per node, this would decrease the likelihood of individual file data loss by restricting the file placement for any given file to 1/4 of the disks, but the performance would be OK because the file is still distributed uniformly by node.
Components that make block placement choices:
Balancer: The balancer must be rackaware and requires a separate placement policy tradeoff (3 blocks on 3 different racks is a good thing there).
SetRep: Replication changes to files, and especially lowering of them, cause lots of placement inefficiencies. For example, if you have a large file, or a collection of files and the block count is much larger than the number of nodes, and you decrease the replication (say from 5 to 4), then the end result is that the nodes that had the highest disk usage at the time will now have ZERO blocks from those files.
Node death / decomission; recommission and new nodes: When a node temporarily dissapears and its missing blocks are replicated elsewhere, placement choices are made. When it comes back and there are excess replicas, deletion choices are made. Both these choices need to be aware of not just the best placement for data loss reasons, and the best placement for performance reasons.
Ideally, a given file, direcotory, or any randomly chosen subset of blocks in the system will be distributed randomly over the cluster.
What actually happens in a growing cluster with new machines being provisioned, replication factors changing in some areas, nodes or racks dissapearing and coming back is that the actual block placement depends on the history of the cluster. Files written long ago may only be on subsets of machines. Blocks involved in a prior temporary node failure may be inefficiently placed. Lots of blocks may have all their replicas in one rack due to this.
I think if the goal is to reduce the chance of data loss, looking holistically at all the components in the system that make such decisions is going to yield better results than looking only at the file write process. Initial block placement can have some variations, but the biggest challenge is in maintaining good block placement IMO.
@Joydeep: Using a term to describe the groups is a great idea. Just bear in mind that we are not dividing the cluster into exclusive groups. Each block has a limited number of blocks in which it can be replicated, but separate blocks might have different sets of nodes associated with, and the intersection between these sets might not be empty.
Here is a sketch of the algorithm to make things a little more clear. This is just a simplification of the algorithm, and I am not describing several corner cases and reconfiguration scenarios.
Configuration parameters:  R: rack window, distance from initial rack we are allowed to place replicas  M: machine window, size of the machine window within a rack Whenever network topology changes:  Sort racks into a logical ring, based on rack name  Sort nodes within each rack into logical rings, based on node names For the first replica:  Write to the local machine, if possible, or pick up a random one For the second replica:  Let r be the rack in which the first replica was placed  Let i be the index of the machine in r that keeps the first replica  Pick random rack r2 that is within R racks from r  Pick random machine m2 in r2 that is within window [i, (i+M1)%racksize]  Place replica in m2 For the third replica:  Given steps above, pick another random machine m3 in r2 that is within the same window used for m2  Make sure m2 != m3  Place replica in m3
I hope this explanation helps solve the confusion.
one can't have a different nodegroup for each block/file. that would defeat the whole point. (in fact  every block today is in a 3node nodegroup  and there are gazillions of such node groups that overlap).
the reduction in data loss probability comes out of the fact that the odds of 3 nodes falling into the same nodegroup is small. (if they don't fall into the same nodegroup  there's no data loss).
if the number of node groups is very large (because of overlaps)  then the probability of 3 failing nodes falling into the same node group will start going up (just because there are more nodegroups to choose from). the more the nodegroups are exclusive  the better. that means the number of nodegroups is minimized wrt. a constant number of nodes. as i mentioned  the size of the nodegroup is dictated to some extent by rereplication bandwidth. one wants very small node groups  but that doesn't work because there's not enough rereplication bandwidth (a familiar problem in RAID).
if u take some standard cluster (say 8 racks x 40 nodes)  how many distinct node groups would ur algorithm end up with?
I'm not sure I understand what you mean by node group. Sometimes it seems to be the machines in which you can have replicas for the same block (always triples), and sometimes it seems to be the bigger set from which you can pick replicas from. Which one is the right one? I guess a more formal definition could break the redundancy.
Rodrigo, I understand now that you place replicas on a limited group of racks. Sounds like Ceph's placement groups.
But the document on probabilities does not address this case, does it? How do you know the probability of data loss is going to be less? How much?
my mental model is that a node group is a fixed (at any given time) of nodes. together the node groups cover the entire cluster. it would be nice to have the nodegroups to be exclusive  but it's not necessary. the reduction in data loss probability would largely come from the small size of each node group (combinatorically  the number of ways a small set of failures can all happen within one node group is much smaller compared to the total number of ways those same number of failures can happen).
the assumption is that the writer would choose one node group (for a block it's writing). if node groups are exclusive and we want to have one copy locally  this means that the node group is fixed per node (all writers on that node would choose the nodegroup to which the node belongs). that is why it seems that an entire file would always belong the same nodegroup (as Hairong has argued).
node groups cannot be too small  because (as i mentioned)  we would limit rereplication bandwidth (and therefore start increasing time to recovery for single bad disk/node). we really need a plot of time to recovery vs. nodegroup size to come up with a safe size for the node group. having exclusive groups is nice because it minimizes the number of node groups for a given node group size.
>  Pick random rack r2 that is within R racks from r
>  Pick random machine m2 in r2 that is within window [i, (i+M1)%racksize]
a few points:
 dangerous to choose physically contiguous racks for node groups (because of correlated failures in consecutive racks). may make things a lot worse.
 if rack numbering is based on some arithmetic (so that logically contiguous is not physically contiguous)  then one has to reason about what happens when new rack is added (i think it's ok to leave existing replicated data as is  but it's worth talking about this case. what would the rebalancer do in this case?)
 easy to reduce the overlap between nodegroups (and thereby decrease loss probability):
 instead of [i, (i+M1)%racksize]  choose [ (i / (r/M))*r/M, (i / (r/M))*r/M + M1] // fixed offset groups of M nodes each in a rack
i glanced through the Ceph algorithm (http://www.ssrc.ucsc.edu/Papers/weilsc06.pdf)  it doesn't try to do what's described here. the number of placement groups is not controlled (and that is not an objective of the algo.).
a close analog of the problem here is seen RAID arrays. When choosing to do parity + mirroring to tolerate multiple disk failures  one can choose to mirror RAIDparity groups or apply parity over RAID mirrored groups (1+5 vs. 5+1). turns out 5+1 is a lot better from a data loss probability perspective. the reasoning and math are similar (both are susceptible to data loss on 4disk failures  but in 5+1  the 4disk failures have to be contained within 2 2disk mirrored pairs. this is combinatorially much harder than the 1+5 case  where the the 4 disk failures have to cause 2 failures each in the N/2 node groups).
@Konstantin: We have a script that calculates the probabilities. The author is on vacation, so I can't just upload it here without his consent (will do that as soon as he agrees). I will post a table with values as soon as I get some time. Right now I'm trying to finish up the first version of the code and test it.
@Joydeep:
> a node group is a fixed (at any given time) of nodes
??? I think you missed the most important part of the sentence
I tried to get the meaning from the rest of your text but some things still look weird. You seem to convey that node groups might be bigger than than the number of replicas you want to create. If so, how do you choose the replicas within the node group. In the algorithm I'm implementing it's not the case that any choice of 3 nodes is valid within what I think you are calling a node group.
> instead of [i, (i+M1)%racksize]  choose [ (i / (r/M))*r/M, (i / (r/M))*r/M + M1] // fixed offset groups of M nodes each in a rack
And how do you cope with racks of different sizes in this case?
 of course i meant a node group = fixed set of nodes.
and of course  replication is done within any three within a node group (and the node group is much larger than 3 of course). i understand there's a subtext to this  that we do want the replicas to span racks and that the writer writes one locally and one within the same rack. that's all good and consistent though. as long as the node group spans multiple racks and there are two from the writer's rack  we are good. i don't see how this is all that different from what the protocol u listed  where u are choosing nodes randomly subject to some constraints.
the main difference is that i am suggesting minimizing the number of node groups. ur protocol implicitly has a finite number of node groups  but i am trying to argue that it's much much more than it needs to be. the math is pretty obvious (though tedious). the reduction in loss probabilities comes from the small node group. but it is offset by the larger number of node groups. if one doesn't minimize the number of node groups  we start losing out on the benefits of this scheme.
good point about racks with different number of nodes. let me think about it.
I might be wrong, but it seems to me that you are oversimplifying things and the notion of node groups as you present is not sound.
However, I can confirm that with a clear definition of what a 'node group' is. When you say 'fixed set of nodes', for me it could be all the nodes, it could be 2node sets, it could be a partial set, it could be pretty much any subset. What characterizes valid node groups? How many of them there are? What is their size? I can't discuss in terms of something that is not defined well.
What I understand is the restriction you made in the selection of nodes for the algorithm I described. Kannan, Karthik, Aravind, Dhruba, I, and others discussed about that several times when we first started wondering about data loss probabilities in HDFS. From what I can remember, the main argument against it is that it doesn't generalize gracefully when racks can have different sizes, and it was not clear how much better it was compared to the current approach or if the current approach could be just configured to have the same data loss probability by simply reducing the size of the windows (Note that the current approach is configurable and we can pretty much decide what is the data loss probability we will have based on the cluster configuration and the window sizes). If the others are following this discussion, they might want to chime in and say a few more words about what else made us give up on this idea before.
(this is the closest i will get to writing a book. dedicated to u Rodrigo)
what else is there to say? i have said it all:
 node group is fixed (at a given point in time) set of nodes
 all the node groups together span the entire cluster
 a given block is replicated only within member nodes of a single node group
this is the definition. there are no 'invalid' node groups. but nodegroups have certain properties that we know to be highly desirable:
 node groups should ideally, but not necessarily, be disjoint. the probability of data loss will be positively correlated to number of node groups and inversely correlated to the size of the node group. overlapping node groups cause excessive number of node groups for the same sized node group and hence are not desirable.
 should all node groups be of equal size?: ideally yes. this will reduce the loss probability to a minimum (holding other factors constant)
 all node groups should span multiple racks: this is because we know that racks are a unit of failure. so to accomodate a replication policy that spans racks  we need the node group to span racks. how many racks it should span is ideally configurable.
 similarly, we know that hdfs wants the writer to make one node local and one racklocal copy. that pretty much means that each node group has at least 2 nodes for each rack it spans. if it's chosen as
 because replication does not cross a nodegroup boundary  recovery from disk/node failure is constrained by rereplication bandwidth available inside a node group. this is inversely proportional to the size of the nodegroup. if the nodegroup is too small  then the time to recovery (from bad disk/node) goes up. so one cannot have an extremely small node group (like a 2node group).
i cannot suggest the optimal size of a nodegroup without more data on this front. if one increases MTTR  then the odds of data loss go up. others have said the same thing and this is data storage 101. so unless someone plots the time to recovery for single node failure (for a reasonable sized node) vs. the size of a nodegroup  it's hard to say what's high enough.
if rereplication bandwidth was infinite  then one could have really small nodegroups. this is what was done in the old days  all disks were mirrored in pairs. and disks were small enough that a mirrored disk pair had a very small MTTR. of course, disks have become 1000fold in size since then and their bandwidth hasn't increased much. so this strategy doesn't work anymore.
 it would be foolhardy to define node groups in a way that puts correlated units of failure in the same nodegroup. this is pretty obvious. which is why choosing contiguous racks for a node group may not be a good approach at all (given common rack based rollout strategies for new hardware as well as likely correlation in temperature)
i would love to learn about how i am oversimplifying things. it's really simple combinatorics for me:
 when considering simultaneous failures  the data loss probability will be dominated by what happens when ,number of nodes = degree of replication = 3(say), fail.
 if one considers the 3 node failure scenario  to cause data loss  one must place all the 3 failed nodes in the same nodegroup. the number of ways one can do this is:
a. directly proportional to the number of node groups (obviously)
b. very strongly inversely correlated to the size of the node group
as a simple example  if i were to divide 100 nodes into
 10 groups of 10 each => 10 * 10C3 = 1200
 5 groups of 20 each => 5 * 20 C 3 = 5700
as u can see  a 2 fold reduction in group size leads to about 6 fold reduction in number of ways of hitting failure (and hence in the odds of failure).
 even if u consider higher numbers of failures  similar combinatorics still apply.
i hope the arguments are crystal clear now  all else being equal  it's a bad idea to have overlapping nodegroups. we will be holding factor (b) constant  but raising factor (a).
beyond this  as i said  i don't have a precise layout strategy right now. the naiivest approach is what u said u may have already discussed. we don't have heterogenous clusters. even if we had  we may have (at worst) a mix of 1U and 2U nodes  one could just consider the 2U nodes to be = 2 x 1U. so in the naiive strategy, i would:
 treat the nodes and racks as a 2D grid.
 random hash the rackids into a different id space for purposes of labelling on this 2D grid (logically contiguous is not physically contiguous). one could do that with nodes as well.
 divide the grid into NxM sized disjoint nodegroups (a rectangular subgrid) where N = racks and M = nodes. Admin decides
 NxM based on rereplication bandwidth
 N based on desired rack fault tolerance
 N >=2 (3 at least i imagine). M >= 2
 on rack addition  on the rackid space  it will be likely inserted in the middle of existing racks. in the naiive strategy:
 only newly written blocks would be written as per the revised nodegroup definitions
 the system would have more than the minimum possible number of nodegroups (it would have the new groups as well as the old groups). but that's ok.
i think smarter strategies would differ primarily based on how elegantly they handle rack insertions. the other way to handle this stuff is to have node groups not be hard wired based on topology  but just have software generate node groups and their members. it's much easier to adjust for reconfigurations  one can easily come up with protocols to remap nodegroups to make full use of new rack  but minimize changes to existing groups. this sort of control data is rarely mutated and no one dies if everyone's not in sync. so it's an easy thing to share in a cluster.
According to your definition, what prevents me from saying that there is only one node group, which is the set of all nodes in the cluster? Or why not to say that all triples are node groups? Or any combination of k>3 nodes?
What makes me think this definition is not sound is the fact that it doesn't link node groups to replication policies. For the rationale to be sound, there should be something saying what is the node group a policy generates.
Your examples seem to suggest maximal sets from which any triple is a valid selection for replication. If that's the case, you are oversimplifying things. The restriction that one replica is on a rack and two replicas are in another one prevents several triples in both approaches, and this has nonnegligible impact in the math.
Your argument about disjoint and nondisjoint groups is valid only if you consider groups of equal size. A strategy with big nondisjoint groups can be as good as better than one with disjoint groups if you reduce the size of the groups it defines.
@Rodrigo  it's evident u have not read my post beyond first three lines. please read through in detail. if ur questions are not answered (ur comments are all categorically wrong. i have said none of the things u say i have said  and quite to the contrary)  please come and talk.
I had read it before. I read it again now and I still have the same questions. I guess I'm probably missing something.
Anyway, the whole discussion about the node group definition might not be essential. We can probably take it offline.
I understand your proposal about having disjoint sets of nodes, and I think it is a valid one. We decided to do it differently because the current algorithm seemed to generalize better and could get to better probabilities by reducing window sizes.
The "node group" versus Rodrigo's definition:
Node groups – yes, overlap is bad. But, no overlap is also bad. Consider:
If there is no overlap, and the policy is to write a 3 replica block to two racks, one with two replicas and another with one, then the minimal group size is four – you must have two rack and you must have at least two nodes per rack. A larger group must always have two nodes per rack. With N racks in the node group, a rack failure causes 1/N of the blocks to be reduced to 1 replica.
It is possible to define overlapping groups with a better worstcase scenario here, with the tradeoff being increased risk due to overlap.
However, this whole enterprise needs to be built on a better foundation.
some problems
Disks versus nodes:
Start considering disk failure, not just node failure. Newer Hadoop versions don't fail the node when one disk in the node fails. There are several strategies to reduce data loss without impacting time to replicate regarding disk placement instead of node placement. This is because the dominant factor in time to replicate is network bandwidth, which can only be improved for replication if node groups or placement policy spans more racks and nodes.
What we should really be talking about here is "disk groups" and "disk placement policy" not "node groups" and "node placement policy".
Disk is to Node as Node is to Rack.
Limiting the combinatorics at the rack level is the most dangerous and increases time to recovery the most.
Limiting the combinatorics at the node level is moderately dangerous because it increases time to recovery somewhat.
Limiting it at the disk level is a 'free' win, since it does not increase the time to recovery much at all – that is limited mostly by network bandwidth.
Consider this – in a 400 node cluster each with 4 disks, there are C(400,3) ways for node failure to occur.
But, there are C(1600,3) ways for disk failure to occur. If you restrict the way replicas are placed not by node or rack, but simply by disk (say, take the block # mod the # of disks in a node to choose which node), then this reduces the number of disk failure combinations that can cause data loss from C(1600,3) to 4*C(400,3). This is a factor of 16 improvement (# of disks ^2) without any loss in TTR as long as disk speed >= network speed.
Time to recovery. Since the odds of 3 simultaneous failures depend highly on TTR, this is a huge factor.
Here is a simple example of how time to recovery is impacted. The time to recover is roughly equal to:
The maximum ammount of data that a single 'survivor' node has that must be replicated. If we simplify this to situations where replication has dropped to 1, this is easy to analyze. You must get the data off that worst case node.
There is a similar analysis that has to be done for a rack as well – 'sole survivor' blocks in a rack have to be moved off rack or data loss will be a problem.
For a node, the limit is the network, usually 1Gbit/sec or ~100MBsec. To move 1TB of data, that takes ~10,000 seconds, or 2 hours and 47 minutes! Brand new Hadoop hardware tends to have 8TB on a node these days (though network bonding to 2Gbps is more common).
Bandwidth between racks tends to be between 1/16 and 1/4 of the aggregate bandwidth within a rack. So worst case time to replicate tends to be longer than for a node if the goal is to make sure all lost data exists on at least two racks.
So how does that affect things?
Lets assume a node has a 1 in 10,000 chance of failure in a given hour (there are 8760 hours in a year). Lets assume only 3 nodes for now.
If a node fails, then the odds of the other two failing are (1/10000)^2, one in 100,000,000.
If the time to recover is 3 hours instead of 1, the odds are (3/10000)^2, one in 11,111,111. Increasing the time to recover by a factor of 3 increases the chance of data loss by a factor of 9. If this was a 4replica situation, it would be an even stronger relation.
The relationship between TTR and chance of data loss is
(TTR)^(R1)
where TTR is the time to recover and R is the number of replicas.
For node failure, we consider the number of replicas = 3. For rack failure, it is 2 (blocks are only on 2 racks). So node failure is more sensitive to TTR than rack failure, but the latter is obviously more dangerous.
"node groups" or "restricted placement" and TTR
If the replication algorithm prioritizes the lowest replication count blocks first, then the TTR is proportional to the size of the 'sole survivor' data on the node with the most such data.
M racks, N nodes per rack, 4TB data per node, 1GBps node bandwidth, 10Gbps rack bandwidth.
Analysis – current (random) policy: write local, then choose random rack and node for second replica, and for third replica choose random node in same rack as second node.
 Rack Failure: When a rack fails, it leaves behind many 'sole survivor' blocks. In general, 2/3 of its blocks have only one replica outside the rack and require intrarack transfer. So 2N * 4TB/3 GB have to be transferred. The other 1/3 can be transferred in rack, which is not the limiting factor here.
These replicas are evenly spread over the remaining M1 racks, and copying immediately starts – replicating blocks to another rack and then to another node in that other rack. The rack bandwidth is the limiting factor:
2/3(N * 4TB) /(M1) transferred off of each rack to another. At 10Gbps rack bandwidth this is ~1000MB/sec. If we assume 40 servers in a rack and 10 racks this is ~11.8TB at 1000MB/sec or ~=11800 seconds ~=3.3 hours.
 Node Failure: A single node, provided a perfect placement policy by the NN, balancer, and writers, should not have two replicas of one block. But its TTR to get the replica count back to the prefailure state is as follows:
All 4TB of blocks have to move, but these are spread out in many racks. 1/3 of the node's data was written by it, and those replicas are all offrack in pairs. Only one of these pairs however has to be considered to generate a copy. However, in order to maintain good block placement this replication has to go off rack. So all replication goes between racks for 1/3 the data. 2/3 of the data is 'paired' data where one other block is already inrack and the other is in another rack. This data can be replicated within rack. If scheduled optimally, 2/3 the data is replicated in rack evenly, and 1/3 is replicated cross rack evenly.
Inrack replication is limited by the node bandwidth, and the 2/3 of the data that needs this sort of replication is spread across the remaining nodes – it is not spread evenly (there is more in the rack that had the node fail) but will be anywhere on the cluster. So this is 2/3 * 4TB / (N*M 1) = 6.7GB per node @ 1Gbps = 67 seconds.
The 1/3 of the data that must flow between racks takes more – it is evenly distributed across nodes on other racks, so each rack must transfer 1/3 * 4TB / M data off it, in this example 133GB at 10Gbps = 133 seconds.
Now, what about a policy that created the minimum 4 node 'node group'. Two in one rack, two in another? If a rack dies, the calculation is similar to above. Assuming node groups have random pair other racks with no limitations , then the "survivor" parts of the group must choose new partner node pairs and replicate to them, offrack. 2/3 of the blocks will have only one replica, 1/3 will have two. None of it is offrack, so all of it must be transferred. This means the time to replicate is 3/2 the policy above, about 4.9 hours. This directly translates into a 50% increased chance of data loss or availability loss from rack failure.
When a node dies out of the 'set' of four, all of the blocks it wrote are on the two nodes in the other rack (1/3 of its data). 2/3 of its data was shared by its 'peer' in the same rack, which now must replicate all of its data to a new peer in the same rack or elsewhere. Lets give it the benefit of the doubt and assume it can choose a peer in the same rack. It is limited by this replication – 2/3 of 4TB from one node to another. This takes 7.4 hours. The other nodes in the other rack must also write to this peer to up their replication from 2 to 3, which increases this time even further because it can only write so fast. Those blocks can be replicated to the other node instead, but then these will be very unbalanced. The absolute minimum time is the 7.4 hours above, and the 'balanced' time is 1.5x that.
factoring in TTR
So, if the odds that a failure of 3 nodes with replication = 3 loses a block's data on a 400 node cluster with uniform block placement is one in C(400,3) = one in 10.6 million.
With 100 groups of 4 nodes, the likelihood is the same, for a given block. However, for all blocks, it is reduced with the exchange that when a failure occurs, more blocks will be lost.
There was C(400,3) different combinations that would lead to data loss, and now there are only (100 * C(4,3)) that will.
This is a reduction in data loss probability by (C(400,3)/(100*C(4,3))) or a factor of 158,000.
However, since the time to replicate is now not 133 seconds, but 26,667 to 40,000 seconds, 200 to 300 times longer. This increases the likelihood of such a 3 node failure dramatically. At best this results in an increase in the chance of such a condition by a factor of 40,000. And for blocks that are higher replication, its even worse – 8,000,000+.
So, while this strategy will likely help out repication = 2 loss, and sometimes help replication = 3, it is disastrous for replication = 4 or more.
Furthermore, it complicates block placement, removal, and emergency replication significantly.
To have higher data safety, I feel that working on reducing TTR and preventing normal cluster activity from causing bad placement and distribution is most important.
However, a diskbased strategy can be of use without impacting TTR. Disk failure, node failure, and rack failure are all different beasts here.
This is a really an excellent discussion and something that is at the heart of providing good data availability guarantees of HDFS.
In our production HDFS clusters, I have never seen "missing blocks" occuring at all, especially because HDFS causes blocks to replicate as soon as one of the replicas goes missing. Each of our nodes have 12 TB, but because we have a 2000 node cluster, the MTTR is small, less than 1 hour or so.
However, I have seen "missing blocks" occur many many times when a large number of machines reboot at the simultaneously. This occurs when a powerfailure event occurs or when a bad piece of software inadvertently decides to reboot many machines at the same time. Many cluster nodes refuse to startup after a powerfailure event and needs manual intervention to resurrect them. Out of our 2000 node cluster, it is likely that 50 odd machines refuse to startup. These 50 machines are randomly scattered among all the 2000 nodes, their rack or node placement in the network topology usually has nothing to do with their refusal to startup. It is usually the case that a few blocks go missing until a human goes and restart most of those 50 machines.
I have never seen simultaneous correlated disk failures on different nodes.
So, for a cluster that has two thousands of nodes, it might make sense to create 20 nodegroups , each of size 100 machines or so. For smaller size clusters, the current randomallocation policy might continue to prove the best. This should keep the TTR relatively small.
I would like to hear from other system administrators on whether they frequently see "missing blocks" occurances and their reasoning/analysis on what could have caused those situations. It will help us better understand the best policy to adopt for blockplacement.
This JIRA does not yet talk about the engineering aspects of implementing a nodegroup. By design, these groups have to be persisted across cluster restarts. Rodrigo's approach says that arranging the IP address of nodes in some priorityorder gives us a way to decide node groups. What are the commandline utilities to list/specify nodegroup sizes? do we allow the resizing of nodegroups on a running cluster? can we say that this proposal requires the existence of a "includes" file so that the nodegroup assignment remains the same even if a few nodes are transiently out of service? do we define the priorityorder of nodes so that it is circular in nature: if so, then it could cause some trouble when new nodes are added to the cluster. can we say that a new balancer has to be developed that detects and fixes block replicas based on the nodegroup policy?
Thanks Dhruba for the clear explanation.
It seems to me that this problem is not as simple as initially suggested. Could we have a selfcontained design doc?
I got permission from Alex Smith to post his script that calculates probabilities here. I'll do that in a minute, together with a wrapper I wrote that, given some cluster configurations, displays the data loss probabilities of the following approaches:
 DEFAULT: The default block placement policy used by HDFS
 RING GROUPS: The algorithm I described
 DISJOINT GROUPS: Separating the cluster in disjoint node groups as Joy proposed.
Both RING and DISJOINT depend on windows of choice (node groups) that I define two dimensions: racks and machines_within_rack. Thus, the total size of the window is racks * machines_within_rack machines.
Below you will find some numbers. You might notice that the difference between RING and DISJOINT is not as big as one could possibly think. Assuming the scripts are right, here is a quick explanation.
RING doesn't have as much freedom in selecting triples inside the window as DISJOINT has. In RING, the first replica is fixed and defines the group. So, the space of choice is limited to only two replicas that must be in the same rack.
DISJOINT allows the primary replica to be in any rack in the window, allowing more triples.
This was part of the confusion I was having understanding the 'node groups' definition and math, since it was not taking this factor into consideration.
Here are the numbers:
===== 6 racks of 20 machines = 120 machines =====
DEFAULT => 0.00010667
RING GROUPS (window = 2 racks, 5 machines) => 2.37756e06
DISJOINT GROUPS (window = 2 racks, 5 machines) => 1.19392e06
RING GROUPS (window = 2 racks, 10 machines) => 1.04558e05
DISJOINT GROUPS (window = 2 racks, 10 machines) => 5.3346e06
RING GROUPS (window = 2 racks, 20 machines) => 3.97347e05
DISJOINT GROUPS (window = 2 racks, 20 machines) => 2.22069e05
RING GROUPS (window = 3 racks, 5 machines) => 4.77748e06
DISJOINT GROUPS (window = 3 racks, 5 machines) => 2.38034e06
RING GROUPS (window = 3 racks, 10 machines) => 2.01693e05
DISJOINT GROUPS (window = 3 racks, 10 machines) => 1.06015e05
RING GROUPS (window = 3 racks, 20 machines) => 6.95331e05
DISJOINT GROUPS (window = 3 racks, 20 machines) => 4.38487e05
RING GROUPS (window = 6 racks, 5 machines) => 1.15237e05
DISJOINT GROUPS (window = 6 racks, 5 machines) => 5.91359e06
RING GROUPS (window = 6 racks, 10 machines) => 4.57052e05
DISJOINT GROUPS (window = 6 racks, 10 machines) => 2.61534e05
RING GROUPS (window = 6 racks, 20 machines) => 0.00010667
DISJOINT GROUPS (window = 6 racks, 20 machines) => 0.00010667
===== 50 racks of 20 machines = 1000 machines =====
DEFAULT => 0.00584495
RING GROUPS (window = 2 racks, 5 machines) => 1.99588e05
DISJOINT GROUPS (window = 2 racks, 5 machines) => 9.94932e06
RING GROUPS (window = 2 racks, 10 machines) => 8.92791e05
DISJOINT GROUPS (window = 2 racks, 10 machines) => 4.44541e05
RING GROUPS (window = 2 racks, 20 machines) => 0.000369203
DISJOINT GROUPS (window = 2 racks, 20 machines) => 0.000185042
RING GROUPS (window = 5 racks, 5 machines) => 7.92698e05
DISJOINT GROUPS (window = 5 racks, 5 machines) => 3.94854e05
RING GROUPS (window = 5 racks, 10 machines) => 0.000348759
DISJOINT GROUPS (window = 5 racks, 10 machines) => 0.000174956
RING GROUPS (window = 5 racks, 20 machines) => 0.00133532
DISJOINT GROUPS (window = 5 racks, 20 machines) => 0.000716171
RING GROUPS (window = 10 racks, 5 machines) => 0.000174504
DISJOINT GROUPS (window = 10 racks, 5 machines) => 8.85916e05
RING GROUPS (window = 10 racks, 10 machines) => 0.000742117
DISJOINT GROUPS (window = 10 racks, 10 machines) => 0.00038451
RING GROUPS (window = 10 racks, 20 machines) => 0.00262672
DISJOINT GROUPS (window = 10 racks, 20 machines) => 0.00153868
===== 100 racks of 20 machines = 2000 machines =====
DEFAULT => 0.0160829
RING GROUPS (window = 2 racks, 5 machines) => 4.04848e05
DISJOINT GROUPS (window = 2 racks, 5 machines) => 1.98985e05
RING GROUPS (window = 2 racks, 10 machines) => 0.000181715
DISJOINT GROUPS (window = 2 racks, 10 machines) => 8.89063e05
RING GROUPS (window = 2 racks, 20 machines) => 0.000737402
DISJOINT GROUPS (window = 2 racks, 20 machines) => 0.000370051
RING GROUPS (window = 10 racks, 5 machines) => 0.000352741
DISJOINT GROUPS (window = 10 racks, 5 machines) => 0.000177175
RING GROUPS (window = 10 racks, 10 machines) => 0.00151145
DISJOINT GROUPS (window = 10 racks, 10 machines) => 0.000768873
RING GROUPS (window = 10 racks, 20 machines) => 0.00550483
DISJOINT GROUPS (window = 10 racks, 20 machines) => 0.00307498
RING GROUPS (window = 25 racks, 5 machines) => 0.000906714
DISJOINT GROUPS (window = 25 racks, 5 machines) => 0.000452064
BTW, The numbers assume that p (probability of a single node failing) is 0.001.
Script written by Alex Smith to calculate data loss probabilities of different HDFS block placement policies.
Script I wrote to wrap Alex's program and give the results of different proposals in different cluster settings.
This needs to change "p" from a constant, to a function of the TTR window.
"probablility of a single node failing" alone is meaningless, its concurrent failure that is the issue. The odds of concurrent node failure is linearly proportional to TTR. I think this model needs to assume one failure at odds = 1.0, then use the odds of concurrent failure for the next 2 failures within the time window. A 'constant' chance of failure begs the question, ".001 chance of failure per what?" The first failure happens, that is assumed. Then the next two happen given odds within a time window.
Assuming hadoop failure replication is optimized (which it isn't, the DN dishes out block replication requests too slow).
TTR is inversely proportional to the number of racks in a group for rack failure.
TTR is inversely proportional to the number or racks in a group for single node failure IF the combined bandwidth of the machines in the group in a rack is at least 2x than the betweenrack bandwidth, otherwise it is inversely proportional to the ratio of rack bandwidth to node group bandwidth.
The result is that only the "medium" sized groups above are viable, else it takes too long to get data replicated when a failure happens. Also, the TTR affects the odds of data loss on larger replication counts disproporionatly.
Hi Scott,
I totally understand your concerns. BTW, thanks for the thorough analysis you gave. That was impressive!
I would say that the main problem is the complexity of the calculations. It was already hard to calculate things with this simplified model. Adding more variables would be hard. Besides, I think we are more interested in permanent failures – those that cannot be recovered , since we are trying to reduce the odds of permanently losing data (in an unrecoverable way).
I'm not saying we should not use TTR. I'm just saying that we didn't aim for that in our evaluation.
I guess the best thing we wanted to take from these numbers was the comparison between the different algorithms, and whether it was worth changing the block placement policy or not.
I think this is a great discussion JIRA and I'm really happy with all the great ideas people have been giving. Having said that, it's clear to me that there is no definite solution to the problem. Depending on how you approach it, different policies will be the optimal ones. For instance, take DEFAULT, RING, and DISJOINT. They are all great and valid approaches, each one with its ups and downs. I believe the main result of this JIRA is that people will start creating different probability models and algorithms depending on their use cases, and we will see several block placement policies coming out of it.
I think we are more interested in permanent failures  those that cannot be recovered
That simplifies things. We can ignore rack failure, which is predominantly an availability problem not a data loss problem.
Then, that makes the TTR issue primarily about how many nodes in a group per rack. So the short answer for keeping TTR from growing too large is that the number of nodes in a rack.
R = replication count.
Lets say that p_0 is our baseline probability that R nodes will simultaneously fail. (R=3, p_0 = 0.001 in the calculations above)
Now lets find p, the adjusted probability taking TTR into account.
N = machines per rack in a group.
RB = rack bandwidth.
NB = node bandwidth.
if (NB * N) / RB >= (R1), then p = p_0.
else, p = p_0 * ((RB * (R1)) / (NB * N)) ^ (R1).
Assuming p_0 = 0.001 and R = 3 this is more clearly presented as:
IF (NB * N) >= (RB * 2)
p = 0.001
ELSE
p = 0.001 * ((2* RB)/(NB * N)) ^2
If Rack Bandwidth is 10x node bandwidth, then this is:
IF N >= 20, then p = 0.001
ELSE p = 0.001 * (20/N) ^ 2
In the 100 rack, 2000 node example you have the below as a subsection:
RING GROUPS (window = 10 racks, 5 machines) => 0.000352741 DISJOINT GROUPS (window = 10 racks, 5 machines) => 0.000177175 RING GROUPS (window = 10 racks, 10 machines) => 0.00151145 DISJOINT GROUPS (window = 10 racks, 10 machines) => 0.000768873 RING GROUPS (window = 10 racks, 20 machines) => 0.00550483 DISJOINT GROUPS (window = 10 racks, 20 machines) => 0.00307498
I'm not sure if RING GROUPS are better at TTR issues than DISJOINT GROUPS, I haven't thought through that one. But assuming its the same then adjusting for TTR gives these adjusted data loss odds for the above:
(TTR is 4x longer, so probability of loss is 16x) RING GROUPS (window = 10 racks, 5 machines) => 0.005643856 DISJOINT GROUPS (window = 10 racks, 5 machines) => 0.0028348 (TTR is 2x longer so probability of loss is 4x) RING GROUPS (window = 10 racks, 10 machines) => 0.0060458 DISJOINT GROUPS (window = 10 racks, 10 machines) => 0.003075492 RING GROUPS (window = 10 racks, 20 machines) => 0.00550483 DISJOINT GROUPS (window = 10 racks, 20 machines) => 0.00307498
Interestingly, this almost exactly compensates for the shrinking of group size once TTR is limited to the intrarack bandwidth in the group.
Note that p_0 itself increases with the cluster size. The more nodes, the higher the likelihood of coocurrance of node failure.
My conclusion is that this is useful especially for large clusters with large numbers of nodes per rack, or larger ratios of intrarack bandwidth to interrack bandwidth. For clusters on the other end of those spectrums, its hard to beat the current placement algorithm as long as replication of missing blocks is done at maximum pace.
Of course in the real world, the Namenode does not issue block replication requests at the maximum pace. I did some tests a few days ago with three scenarios (decommission, node error, missing node at startup) and found that the bottleneck in replication is how fast the NN schedules block replication, not the network or the data nodes. It schedules block replication in batches that are too small to saturate the network or disks, and does not schedule batches aggressively enough. So one way to increase data reliability in the cluster is to work on that and therefore reduce TTR.
> We can ignore rack failure, which is predominantly an availability problem not a data loss problem.
We should NOT ignore rack failures. If this simplifies the probabilistic models it's fine, but in practice rack failures should be accounted for. If data is not available for hours as a result of this, which is typical, the clients will start complaining. Also this would be a degradation from the current policy.
I'm trying to understand this, it'll take me a while to get round the maths.
 If you keep stuff all off the same switch, replication costs are lower, maybe P(loss) is lower, but the cost of a network partition is higher, as then the probability of any copy of the data being visible is reduced. That may or may not be something to worry about. Konstantin does, clearly; I think it may be best to detect a partition event and maybe drop into safe mode if a significant percentage of the cluster just goes away. This depends on what you want: replication storm and possible cascade failures vs hard outage. Either way, the ops team get paged.
 If your cluster is partitioned into independent power sources/UPSes, then again, less independence. Failure of the power source takes a big chunk of the cluster offline. This won't look significantly different to the NN/JT, unless they are hooked into events from the UPS.
 Batches of HDDs may suffer from the same flaws. In an ideal world, you'd know the history of every HDD and avoid having all copies of the data on same the same batch of disks until you considered them bedded in. This would imply knowing about the internal disk state of every datanode though...
With pluggable block placement, I'm not sure where we sit with this JIRA.... so.... ping!
One simple solution that come to mind. Let's arrange all possible racks in a cluster and number them in a logical fashion 0, 1, 2, ...n
1. The first replica is on the local node on rack r. Then the other two replicas be randomly selected nodes on either rack r1 or r+1. In this approach, three datanodes in two consecutive racks have to fail simultaneously for a block loss to occur. This is better than the current implementation where any three datanode failures in the entire cluster can cause one block to be lost.
2. The first replica is on the local node on rack r. Let's say that the local node is the 6th node in the local rack. Then the other two replicas of this block will be also reside on the 6th node of any randomly selected remote racks. In this approach, any three datanodes in the same pth position in a rack has to fail for one block to be lost.