At a high level this has a couple of parts to it: Creation of table flow (assuming pre-split table) --------------------------------------------------------------------- 0. HBase layer has policies based on which it decides where to place the region files. AssignmentDomain is defined when a new table is created (today it is all the nodes in the cluster). 1. The HMaster chooses the primary RS on a round-robin basis, and the secondary/tertiary RSs are chosen on different racks (best effort, and best effort to place secondary/tertiary on the same rack). 2. The meta table is updated with the information about the location mapping for the added regions. 3. The RegionServers are then asked to open the regions and they get the favored nodes information as well. The mapping information from regions to favorednodes is cached in the regionservers. 4. This information is then passed to the filesystem (HDFS-) when the regionservers create new files on the filesystem. For now, a create API has been added in HDFS to accept a favorednodes list as an additional argument. Failure recovery --------------------------------------------------------------------- When the primary RS dies, ideally, we should assign the regions on that RS to their respective secondaries (maybe whichever has less load or fewer primaries among the remaining two). At some point the maintenance tool should run and set the mapping in meta right (three RS locations, etc.) Maintenance of the metadata & region locations --------------------------------------------------------------------- Over a period of time, nodes may fail, and/or hdfs-balancer may run that might potentially have a bad impact on the locality set up in above steps. Periodically, a tool would be run that would inspect the meta table, and see if the mapping is still optimal. The tool (as per the code in facebook's branch) takes a couple of options it can optimize for - maximum-locality, minimum-region-reassign, munkres algorithm for assigning secondary/tertiary RS for regions. There is a chore that periodically checks for updates in meta (based on timestamps) for the region locations and updates assignment-plans. In 0.89-fb and in prior versions of HBase, the hbase balancer is run upon regionserver reporting heartbeats, and the balancer basically ensures that the assignment-plans that have been precomputed are met (and regions might get unassigned from their current regionservers, etc.). I think it makes sense to have the above tool be part of the locality-aware loadbalancer itself since the loadbalancer today runs asynchronously and it could do a lot more work. I'll look at this aspect some more. TODO: Handle non pre-split tables