You know how things work when there are deadlines to meet
Totally understand, no problem
1. How would you maintain the mapping of files to groups?
We don't maintain the mapping in HDFS, but use the regionserver group information. Or say, in our use case, this is used along with the regionserver group feature, the admin can get the RS group information through a hbase shell command, and pass the server list to balancer. To make it easier, we actually wrote a simple script to do the whole process, while admin only need to enter a RS group name for data balancing. More details please refer to answer of question #4
wondering whether it makes sense to have the tool take paths for balancing as opposed to servers
In our hbase use case, this is Ok. But I think it might be better to make the tool more general. There might be other scenarios requring balancing data among subset instead of fullset of datanodes, although I cannot give one for now.
2. Are these mappings set up by some admin?
Yes according to above comments
3. Would you expand a group when it is nearing capacity?
Yes, we could change the setting of one RS group, like moving one RS from groupA to groupB, then we would need to use the HDFS-6012 tool to move blocks to assure "group-block-locality". We'll come back more about this topic in answer of question #5
4. How does someone like HBase use this? Is HBase going to have visibility into the mappings as well (to take care of HBASE-6721 and favored-nodes for writes)?
Yes, through HBASE-6721(actually we have done quite some improvements to it to make it simplier and more suitable to use in our product env, but that's another topic and won't discuss here) we could group RS to supply multi-tenant service, one application would use one RS group(regions of all tables of this application would be served only by RS in its own group), and would write data to the mapping DN through favored-node feature. To be more specific, it's an "app-regionserverGroup-datanodeGroup" mapping, all hfiles of the table of one application would locate only on the DNs of the RS group.
5. Would you need a higher level balancer for keeping the whole cluster balanced (do migrations of blocks associated with certain paths from one group to another)? Otherwise, there would be skews in the block distribution.
You really have got the point here Actually the most downside of this solution for io isolation is that it will cause data imbalance in the view of the whole HDFS cluster. In our use case, we recommend admin not to use balancer over all DNs. Instead, like mentioned in answer of question #3, if we find some group with high disk usage while another group relatively "empty", admin can reset the group to move one RS/DN server around.
HDFS-6010 tool plus HDFS-6012 tool would make the trick work.
6. When there is a failure of a datanode in a group, how would you choose which datanodes to replicate the blocks to. The choice would be somewhat important given that some target datanodes might be busy serving requests
Currently we don't control the replication of failed datanodes, but use the HDFS default policy. So the only impact datanode failure does for isolation is that the blocks might be replicated outside the group, that's why we need HDFS-6012 tool to periodly check and move "cross-group" blocks back
Devaraj Das hope the above comments could answer your questions, and feel free to let me know if any further comments.