> Sould we consider a mobile network ...
No, that is not the concern. The concern is that in a large enough installation (1000s of nodes) there are going to be a several instances a week of nodes going down and getting repaired. When the nodes return they may not be in the same racks or on the same switches. This probably does not happen by design, but mistakes are always possible.
> I would simply update network toplogy when a datanode registers or exits.
Certainly, that seems like the right approach. The question is how the topology is updated.
Is it by updating a central configuration file and having the namenode read it? This implies potentially updating configuration every time anything changes in a datacenter.
Is it by running timing experiments every time a datanode registers? This can be biased by transient network conditions.
Neither of the above seem like a productive use of admin or namenode time.
Instead of a network topology interface, why not have a network location interface on the Datanode.
public interface NetworkLocation
This returns an array of hubs that characterize a nodes location on the network. This is probably an array of string of the form, <key>=<value>. So a possible output could be:
rack=r1, switch=s2, datacenter=d3, ...
as many levels as are desirable. Also, network distances between nodes aren't the only things that are interesting. I think it's useful to distinguish between a rack and a switch because a rack is commonly a physical power domain.
Given this output from each Datanode we can then have a concrete implementation of NetworkTopology that simply tracks the membership of each hub. Finding the distance between two nodes is done by comparing their arrays of hubs and stopping where they differ.
> Allocating all blocks of a file to the same 3 racks limits the aggaregate read bandwith.
It limits the aggregate read bandwidth to that particular file only. Every file will have it's own set of 3 racks, so I don't think it affects overall filesystem bandwidth. On the other hand, it potentially gives a client that is reading an entire file better locality.
> Users may specify its replica placement policy
I agree with Doug. This seems like a subsequent feature.