|
[
Permlink
| « Hide
]
Bryan Duxbury added a comment - 08/Dec/07 06:37 AM
This seems like it should be an important issue, since it should significantly improve the performance of the cluster (faster, less network traffic). Elevating to Major.
From what I understand about writing files to data nodes I do not thank is is a major problem. If the regions are not located on the host serving server on the first compaction they would be stored local first and stay local from there on out unless the current servers hard drive is full and it has to store it on a different server. Over all the region servers would self fix this problem by default over time.
So only time this would be an issue is on a restart or after a failed region server. Downgrading priority because we should leverage Hadoop's rack awareness where possible, and there is a lot of work left to do (in Hadoop) before we can
Hi hbasers,
I'd like to work on this issue as my GSOC project "Exploit locality when assigning regions in HBase". After talking with Stack in emails, I have got some initial thoughts on this issue. I'd like to share them with you and welcome for your comments. Before designing a suitable mechanism to using the region's locality, we need to know how blocks are allocated in a hbase cluster and the data-blocks distribution of a specified region over its lifetime in hbase. so that we can find out how the region locality effect the performance. It is difficult to capture all these information in a real cluster. An alternative way to study the locality phenomeon may be simulating the data-block placement procedure in HDFS(local node, local rack, and remote rack) and the regions-allocation mechanism of a hbase cluster in a single machine. And a approximate detail report from simulation can be used for analysis and development. Although I haven't got any detail information about the locality phenomeon, I try to give an initial proposal first. The initial proposal is to schedule the regions to the datanodes(regionservers) that contains most data-blocks of the specified region. The most challenge thing is to know the data-blocks layout(we can query namenode in HDFS to get these information) of a region in master. And an initial method is to record these layout information of regions in .META. table. > Samuel Guo added a comment - 26/Mar/09 06:48 AM
> Hi hbasers, > I'd like to work on this issue as my GSOC project "Exploit locality when assigning regions in HBase". > > After talking with Stack in emails, I have got some initial thoughts on this issue. I'd like to share them with you and > welcome for your comments. > > Before designing a suitable mechanism to using the region's locality, we need to know how blocks are allocated in > a hbase cluster and the data-blocks distribution of a specified region over its lifetime in hbase. so that we can find > out how the region locality effect the performance. It is difficult to capture all these information in a real cluster. An > alternative way to study the locality phenomeon may be simulating the data-block placement procedure in > HDFS(local node, local rack, and remote rack) and the regions-allocation mechanism of a hbase cluster in a single > machine. And a approximate detail report from simulation can be used for analysis and development. Although the JobTracker in Hadoop attempts to assign tasks to machines that are hosting the data, currently I think that direct disk access for local blocks would be the biggest payoff. It is unclear if there is any advantage for locality (other than limiting network access) if direct disk access is not Solid performance data evaluating the cost of: would be highly useful. If there is little difference between 1, 2, 3 (access to a block through a datanode) then I would expect that direct disk access would be much faster than access through a datanode, but there is no As you point out, blocks migrate over time (especially if you are using the HDFS balancer), and that would Suppose there was one 'hot' datanode that hosted blocks from many regions. Using locality might end up in There is a lot of performance evaluation that needs to be done before we actually take the step of using Before we try locality-based assignment, we need to have this analysis to see if the idea is worth pursuing. The going direct to local blocks reading is HADOOP-4801. In summary, the payoff short-circuiting the datanode is small, and yet to be seen – at least to date – and it seems doubtful that a second route to the data will be opened because of security concerns, etc. Thats my take on the issue (It could change of course).
I think that if we only made savings in network traffic, that'd be reason enough to implement locality algorithms. JK makes an interesting point above that we could manufacture hot datanodes if we blindly serve regions from a datanode that hosts all the data but this can happen now since we operate blindly and its only smart use of the locality info that will help damp hot spots. Samuel, if still interested, have you made petition to become a GSOC student using this issue as your project? (Add in some of JKs notes on need to research what happens in a running cluster so know best what to implement). Thanks for your comments, Jim.
> Solid performance data evaluating the cost of: > There is a lot of performance evaluation that needs to be done before we actually take the step of using Yes, I agree with you. We need to do a detail analysis of most behaviors of HDFS and HBase before we try locality-based assignment. And the analysis work will be the main part of my GSOC project. > Suppose there was one 'hot' datanode that hosted blocks from many regions. Using locality might end up in Yes, Locality should be taken carefully not to overload the region server or the data node. An ideal region assignment can assign regions close to its data to reduce network traffic while balancing the loads between region servers, datanodes and avoiding disk competition on the same datanode. As what you suggested, we need to know the following things clearly before making it. I am not so clear now about how to analysis it. but I think I can take them one by one to make things clearly. Thanks stack.
> Samuel, if still interested, have you made petition to become a GSOC student using this issue as your project? (Add in some of JKs notes on need Yes. I will add Jim's notes on my proposal. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||