Thanks Binglin Chang for working on this effort.
I quickly go through it with a few comments below (not completed):
The original Hadoop topology supports a 3-layer topology looks like following:
I think it is better to say: previously, Hadoop only support 2-layers topology: rack and host. Mentioning datacenter layer will confuse user as it is never worked even now. For the same reason, we should mention now we support 3 layers topology/locality: rack, nodegroup and host.
This network topology is designed and work well for Hadoop cluster running on physical server farms. However, for Hadoop running on virtualized platform, we have additional "hypervisor" layer, and its characteristics include:...
I think the use case of NodeGroup layer is even broader than virtualization and suitable for any sub-dependency of nodes between rack and host layer. So, it could be better to say something like "This network topology is designed to work well on Hadoop cluster that only has rack (switch or power) failure dependency among nodes. However, for other cases, like: Hadoop nodes running on virtualized platform, we have additional "hypervisor" layer, and its characteristics include ..."
Due to above characteristics in performance and reliability, this layer is not transparent for Hadoop...
Reliability is more important here, so here better to be "Due to above characteristics in reliability and performance, this layer should't be transparent for Hadoop..."
1st replica is on the local node or local node group of the writer
For more precisely, we may say something like: "1st replica is placed on the nearest node to writer in topology. In most cases, it should be on the same node of writer, but could be on other node in the same nodegroup or rack if node of writer is not qualified (i.e. no local datanode or disk is full) to place replica."
The diagram is better to omit "datacenter" layer according to comments above and red layer of "S1" is better update to "NG1" for reflecting NodeGroup layer.