Issue Details (XML | Word | Printable)

Key: HBASE-57
Type: Improvement Improvement
Status: Open Open
Priority: Major Major
Assignee: Unassigned
Reporter: stack
Votes: 1
Watchers: 6
Operations

If you were logged in you would be able to see more operations.
Hadoop HBase

[hbase] Master should allocate regions to regionservers based upon data locality and rack awareness

Created: 01/Nov/07 05:18 PM   Updated: 24/Apr/09 10:50 PM
Return to search
Component/s: master
Affects Version/s: 0.2.0
Fix Version/s: None

Time Tracking:
Issue & Sub-Tasks
Issue Only
Not Specified

Issue Links:
Dependants
 

Sub-Tasks  All   Open   

 Description  « Hide
Currently, regions are assigned regionservers based off a basic loading attribute. A factor to include in the assignment calcuation is the location of the region in hdfs; i.e. servers hosting region replicas. If the cluster is such that regionservers are being run on the same nodes as those running hdfs, then ideally the regionserver for a particular region should be running on the same server as hosts a region replica.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Bryan Duxbury added a comment - 08/Dec/07 06:37 AM
This seems like it should be an important issue, since it should significantly improve the performance of the cluster (faster, less network traffic). Elevating to Major.

stack added a comment - 08/Dec/07 06:59 PM
This might actually already be happening. Any time I've seen a regionserver reading from hdfs, it seems to be getting its data from the local datanode. Verify.

Billy Pearson added a comment - 29/Dec/07 02:08 AM
From what I understand about writing files to data nodes I do not thank is is a major problem. If the regions are not located on the host serving server on the first compaction they would be stored local first and stay local from there on out unless the current servers hard drive is full and it has to store it on a different server. Over all the region servers would self fix this problem by default over time.

So only time this would be an issue is on a restart or after a failed region server.


Jim Kellerman added a comment - 15/Jan/08 07:39 PM
Downgrading priority because we should leverage Hadoop's rack awareness where possible, and there is a lot of work left to do (in Hadoop) before we can

Samuel Guo added a comment - 26/Mar/09 01:48 PM
Hi hbasers,
I'd like to work on this issue as my GSOC project "Exploit locality when assigning regions in HBase".

After talking with Stack in emails, I have got some initial thoughts on this issue. I'd like to share them with you and welcome for your comments.

Before designing a suitable mechanism to using the region's locality, we need to know how blocks are allocated in a hbase cluster and the data-blocks distribution of a specified region over its lifetime in hbase. so that we can find out how the region locality effect the performance. It is difficult to capture all these information in a real cluster. An alternative way to study the locality phenomeon may be simulating the data-block placement procedure in HDFS(local node, local rack, and remote rack) and the regions-allocation mechanism of a hbase cluster in a single machine. And a approximate detail report from simulation can be used for analysis and development.

Although I haven't got any detail information about the locality phenomeon, I try to give an initial proposal first. The initial proposal is to schedule the regions to the datanodes(regionservers) that contains most data-blocks of the specified region. The most challenge thing is to know the data-blocks layout(we can query namenode in HDFS to get these information) of a region in master. And an initial method is to record these layout information of regions in .META. table.
Some background threads may be run on the master scanning the .META. table to pick up the candidate nodes for region-allocation(these nodes may be sorted by the number of blocks they contain). The detail allocation mechanism will be discussed below.
(1) A blank region created when the table is first created. As we haven't got any data in it, we can allocate it according to the current loads of the cluster. It is an easy way. And after the region grows up and were flushed back to HDFS, we get the blocks' locations information and records them to .META. table for next-allocation.
(2) A region is created by splitting its parent region. We can use parent-region's blocks' location information to make an allocation decision. And after we finish the splitting procedure, we can simply copy the parent-region's blocks' location information to each sub-region's .META. table information.
(3) A region is re-allocated after the regionserver crash. The logfiles' blocks information will be considered into allocation so that we may accelerate the recovery of a failed-region.


Jim Kellerman added a comment - 27/Mar/09 08:38 PM
> Samuel Guo added a comment - 26/Mar/09 06:48 AM
> Hi hbasers,
> I'd like to work on this issue as my GSOC project "Exploit locality when assigning regions in HBase".
>
> After talking with Stack in emails, I have got some initial thoughts on this issue. I'd like to share them with you and
> welcome for your comments.
>
> Before designing a suitable mechanism to using the region's locality, we need to know how blocks are allocated in
> a hbase cluster and the data-blocks distribution of a specified region over its lifetime in hbase. so that we can find
> out how the region locality effect the performance. It is difficult to capture all these information in a real cluster. An
> alternative way to study the locality phenomeon may be simulating the data-block placement procedure in
> HDFS(local node, local rack, and remote rack) and the regions-allocation mechanism of a hbase cluster in a single
> machine. And a approximate detail report from simulation can be used for analysis and development.

Although the JobTracker in Hadoop attempts to assign tasks to machines that are hosting the data, currently
HDFS clients still must go through the data node to access the blocks. I believe there is a Jira open for Hadoop
to go directly to local disk to get the blocks (removing communication with the datanode) if the blocks are on
the local machine.

I think that direct disk access for local blocks would be the biggest payoff.

It is unclear if there is any advantage for locality (other than limiting network access) if direct disk access is not
available. Benchmarking needs to be done to determine the relative latency of accessing a block from another
server is much faster than accessing a block (via the datanode) if the block resides on the local machine. It may
not provide much advantage, unless the block is being served from a server on the other side of a switch or
router (and of course the speed of the connection).

Solid performance data evaluating the cost of:
1) network access to a block in a different rack
2) network access to a block in the same rack but on a different server
3) network access to a block on the same server
4) direct disk access to a block on the same server

would be highly useful. If there is little difference between 1, 2, 3 (access to a block through a datanode) then
locality may not be useful. On the other hand, if there is a significant difference between 1, 2, 3 then we should
try to exploit locality if we can.

I would expect that direct disk access would be much faster than access through a datanode, but there is no
hard data on that available. If data were available, then that would indicate if direct disk access (bypassing
the datanode) is useful or not. If so, that argues in favor of implementing direct disk access as has been
proposed, and if direct disk access were available, that would argue in favor of locality based region assignment.

As you point out, blocks migrate over time (especially if you are using the HDFS balancer), and that would
greatly complicate assignment of regions to region servers.

Suppose there was one 'hot' datanode that hosted blocks from many regions. Using locality might end up in
overloading the region server on that node, resulting in poorer performance.

There is a lot of performance evaluation that needs to be done before we actually take the step of using
locality-based region assignment. If doing that performance evaluation sounds interesting to you, I think
that would be a great GSOC project.

Before we try locality-based assignment, we need to have this analysis to see if the idea is worth pursuing.


stack added a comment - 29/Mar/09 09:47 PM
The going direct to local blocks reading is HADOOP-4801. In summary, the payoff short-circuiting the datanode is small, and yet to be seen – at least to date – and it seems doubtful that a second route to the data will be opened because of security concerns, etc. Thats my take on the issue (It could change of course).

I think that if we only made savings in network traffic, that'd be reason enough to implement locality algorithms. JK makes an interesting point above that we could manufacture hot datanodes if we blindly serve regions from a datanode that hosts all the data but this can happen now since we operate blindly and its only smart use of the locality info that will help damp hot spots.

Samuel, if still interested, have you made petition to become a GSOC student using this issue as your project? (Add in some of JKs notes on need to research what happens in a running cluster so know best what to implement).


Samuel Guo added a comment - 30/Mar/09 01:54 PM
Thanks for your comments, Jim.

> Solid performance data evaluating the cost of:
> 1) network access to a block in a different rack
> 2) network access to a block in the same rack but on a different server
> 3) network access to a block on the same server
> 4) direct disk access to a block on the same server
> would be highly useful. If there is little difference between 1, 2, 3 (access to a block through a datanode) then
> locality may not be useful. On the other hand, if there is a significant difference between 1, 2, 3 then we should
> try to exploit locality if we can.

> There is a lot of performance evaluation that needs to be done before we actually take the step of using
> locality-based region assignment. If doing that performance evaluation sounds interesting to you, I think
> that would be a great GSOC project.

Yes, I agree with you. We need to do a detail analysis of most behaviors of HDFS and HBase before we try locality-based assignment. And the analysis work will be the main part of my GSOC project.

> Suppose there was one 'hot' datanode that hosted blocks from many regions. Using locality might end up in
> overloading the region server on that node, resulting in poorer performance.

Yes, Locality should be taken carefully not to overload the region server or the data node. An ideal region assignment can assign regions close to its data to reduce network traffic while balancing the loads between region servers, datanodes and avoiding disk competition on the same datanode. As what you suggested, we need to know the following things clearly before making it.
1) what is the difference we access data from different locations(local, local by-pass, remote, remote rack)?
2) In regions' life time, what is the data-blocks' distribution? And how many bytes that the region reads data from local node? how many from remote? from remote rack?
3) After a balance operation happened in HDFS, how 2) changes?
4) After some region servers failed, how 2) changes?

I am not so clear now about how to analysis it. but I think I can take them one by one to make things clearly.


Samuel Guo added a comment - 30/Mar/09 01:56 PM
Thanks stack.

> Samuel, if still interested, have you made petition to become a GSOC student using this issue as your project? (Add in some of JKs notes on need
> to research what happens in a running cluster so know best what to implement).

Yes. I will add Jim's notes on my proposal.


stack added a comment - 24/Apr/09 10:50 PM
If this makes it in time for 0.20.0, good, but for now moving it out.