Region locality information is needed by the balancer to generate region plans. Computing HDFSBlockDistribution is expensive on larger clusters and adds load to the NameNode. This also needs to be recomputed on a master restart. The proposal is to get the HDFSBlockDistribution from the RegionServers instead of computing it in Master. RS already has this information and we could just reuse it by querying it. RS already passes dataLocality info via RegionLoad today.
Proposed Implementation: This is a high-level overview.
- A RegionServer API has to be added which will return HDFSBlockDistribution for all the regions it hosts. RS already has this info. Since ClusterStatus has already become bulky and we don’t need updated locality so fast, it’s better to have another API rather than add this to RegionLoad and pass it along with RSReport.
- Master will have a Chore to query all RegionServers and will cache the HDFSBlockDistribution for those regions. This is easy and quick. Admins can tune the frequency based on size of the cluster. On a ~90 nodes cluster with 500k regions and a prototype implementation and no load, it took about 5 seconds to get all HDFSBlockDistribution from RS.
- The cache will be an extension of RegionLocationFinder (subclass), if needed to keep the implementation simple. Probably will get clear with implementation.
- Balancer will use the new cache to get all HDFSBlockDistribution. If there is a new region and Chore didn’t get the block distribution from RS during its previous run, then it will be computed by RegionLocationFinder the same way it has been done now. If the Chore runs more frequently like every hour, then this recomputation will be drastically reduced.