Thanks Lei (Eddy) Xu! These are very good points. Here is the updated design doc that answers some of your questions in details. Please find specific replies below.
How about call it Availability Domain
Availability might be too general in this context. The service can become unavailable due to unplanned event such as TOR outage or planned maintenance such as software upgrade. Both can impact the availability. If we define "Availability Domain" as "if all machines in that domain aren't available, the service can still function", then machines belonging to one rack can also be considered in one availability domain.
Is this upgrade domain on each DN a soft state or a hard state?
It is a hard state, just like network location of the node. While admins likely keep upgrade domain unchanged during common operations; the design allows admins to move machines around as long as the machines are decommissioned properly at the first place and thus when machines rejoin under different upgrade domains, the proper replica will be removed. The updated design doc provides more details on this.
What do you anticipate as a good strategy to choose upgrade domains UDs?
Updated design doc has more on this. The number of upgrade domains has impact on data loss, replica recovery time and rolling upgrade parallelism.
Regarding the performance impact
- of racks is in the order of 100, # of upgrade domains is in the ballpark of 40, # of addBlocks operation is around 1000 ops / sec at leak.
In design v2.pdf, could you mind to rephrase the process of "Replica delete operation"?
Updated design adds more description.
The last one maybe not relevant: would this design work well with erasure coding (
Similar question was asked in HDFS-7613, how we can reuse different block placement policies. Like you said, we can address this issue separately.