Issue Details (XML | Word | Printable)

Key: HADOOP-4116
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Hairong Kuang
Reporter: Raghu Angadi
Votes: 0
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Balancer should provide better resource management

Created: 08/Sep/08 04:23 PM   Updated: 08/Jul/09 04:43 PM
Return to search
Component/s: None
Affects Version/s: 0.17.0
Fix Version/s: 0.19.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works balancerRM.patch 2008-09-18 08:14 PM Hairong Kuang 20 kB
Text File Licensed for inclusion in ASF works balancerRM1.patch 2008-09-23 11:20 PM Hairong Kuang 21 kB
Text File Licensed for inclusion in ASF works balancerRM2-b18.patch 2008-09-24 09:51 PM Hairong Kuang 21 kB
Text File Licensed for inclusion in ASF works balancerRM2.patch 2008-09-24 10:13 PM Hairong Kuang 23 kB

Hadoop Flags: Incompatible change, Reviewed
Release Note: Changed DataNode protocol version without impact to clients other than to compel use of current version of client application.
Resolution Date: 25/Sep/08 08:28 PM


 Description  « Hide
The number of threads are currently limited on datanodes. Once these threads are occupied, DataNode does not accept any more requests (DOS). Recently we saw a case where most of the 256 threads were waiting in DataXceiver.replaceBlock() trying to acquire balancingSem. Since rebalancing is (heavily) throttled, I would think this would be the common case.

These operations waiting for active rebalancing threads to finish need not take up a thread.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Hairong Kuang added a comment - 12/Sep/08 06:28 PM
The more close investigation of the problem shows the balancer needs additional improvements:

(1) The balancer needs to better handle block move timeout well. Currently it simply assumes that
the timeouted move is failed but does not take the effort to make sure the move is interrupted and the resources the
move takes is released. The next phase of scheduling may schedule more blocks to move from the same DataNode thus using
more and more resources.

(2) Resource control for the balancing purpose at DataNodes should use a fair Semaphore. Currently
it uses an unfair Semaphore that makes no guarantees about the order in which threads acquire permits. A
thread invoking acquire() can be allocated a permit ahead of a thread that has been waiting. Therefore, if a dfs
cluster has many DataNodes that has a long queue of block move requests, it is very likely to enter the
following state: A thread in DataNode A holding a permit and asks DataNode B to receive a block, while DataNode B has a
thread holding a Semaphore and asking DataNode A to receive a block. Although the block move from B to A was scheduled
much later than the move from A to B, they may be executed simultaneously. Both block receives are blocks on acquiring
a permit assuming only one permit can be issued. Therefore, a deadlock occurs.


Raghu Angadi added a comment - 12/Sep/08 06:37 PM
Thanks for the detailed analysis Hairong. I filed the original jira just looking at one symptom.

Hairong Kuang added a comment - 13/Sep/08 12:09 AM
Proposed changes to the Balancer:
1. Remove the use of Semaphor at DataNodes. Instead a DataNode uses a counter to manages the number of concurrent block moves. On receiving a block move request while maximum block moves are in progress, reject the request immediately.
2. Let the receiver initiate the block move; The sender rejects the request when the maximum number has already reached. As a result when either the sender or the receiver does not have resource to handle block move, the block content will not get transfered across network.
3. The balancer does not set a timeout on a socket. Instead, it sets the option KeepAlive on the socket. So a block move does not timeout no matter how slow it goes and next phrase of scheduling does not get started when there is a pending block move.

Hairong Kuang added a comment - 18/Sep/08 08:14 PM
A patch for review.

Bo Shi added a comment - 19/Sep/08 10:54 PM
Hi Hairong, I'm interested in testing out your fix - is there, by any chance, a patch against the 0.18.x series?

I've only been able to partially apply your patch to to the official 0.18.1 release and it seems there was a major code reorg between 0.18 and 0.19 branches.


Konstantin Shvachko added a comment - 20/Sep/08 01:08 AM
  • I do not understand what was the reason for changing the push block model to the pull model.
    Previously balancer would send OP_COPY_BLOCK request to the proxy node, which then sent OP_REPLACE_BLOCK to the target.
    Now it is reversed: the balancer sends OP_REPLACE_BLOCK to the target, which sends OP_COPY_BLOCK to the proxy.
    In both cases DataXceiver is supposed to check the XceiverCount and throw an exception if it is exceeded.
    So in both cases the transfer will fail if any of the two data-nodes are busy.
    I don't see a mistake here, but don't see a reason for this rather radical change either, may be I am missing something.
  • BalanceManager
    • should probably be called BlockBalanceThrottler.
    • It makes sense to derive it from BlockTransferThrottler.
    • It should be a static class.
    • And it should rather be a member of DataXceiveServer than DataNode.

Hairong Kuang added a comment - 22/Sep/08 06:00 AM
In the push model, if the proxy is allowed to transfer the block but the destination does not allow to receive it. The whole/partial block will still be transferred to the destination's system buffer but then be thrown away. Although the block is not delivered to the target's datanode buffer, network resources are still wasted.

Bo, I will create a patch to 0.18 the first thing when I get to the office tomorrow.


Hairong Kuang added a comment - 23/Sep/08 05:12 PM
The patch incorporates Konstantin's comments.

Raghu Angadi added a comment - 23/Sep/08 05:37 PM
> 3. The balancer does not set a timeout on a socket. Instead, it sets the option KeepAlive on the socket. So a block move does not timeout no matter how slow it goes and next phrase of scheduling does not get started when there is a pending block move.

How does keep-alive matter? TCP has no inherent timeout for a connection in normal state.


Hairong Kuang added a comment - 23/Sep/08 06:02 PM
As I said, my intention is that receiveResponse never times out in normal state no matter how slow the other side is. Setting KeepAlive is for detecting the other side's machine gets crashed suddenly so it won't wait there forever. But for all other cases, it will return eventually. Does it make sense?

Raghu Angadi added a comment - 23/Sep/08 07:10 PM
Oh.. it is to detect if the other side has crashed. It make sense. The keepalive might be in the order of hours I guess.

Hairong Kuang added a comment - 23/Sep/08 08:12 PM
Yes, the default is 2 hours. My argument is that the performance of Balancer is not of importance comparing to using too many resources. Anyway the administrator can stop the balancer anytime if she sees that the balancer has not made any progress for hours.

Raghu Angadi added a comment - 23/Sep/08 09:34 PM
Yes. I think hours is fine in this case. thanks for the clarification.

Konstantin Shvachko added a comment - 24/Sep/08 12:50 AM
+1.

Steve Loughran added a comment - 24/Sep/08 09:26 AM
keepAlives are a fairly weak way of assessing "liveness" because
-it works at the network stack level, so your app may still be dead but the KA packets are happy
-if there are a lot of (idle) connections between two hosts, a lot of KA traffic can be generated, rather than one packet per host, which is how a lot of protocols (CORBA and DCOM, for example) communicate "we are still alive".

I think this proposal is better than nothing, but we need to be aware of limitations. It will detect a network partition, but not a hung far end if the network stack is still up


Hairong Kuang added a comment - 24/Sep/08 06:40 PM
I agree that KA does not detect dead apps on the other side. Besides timeouts, is there any good solution to this problem?

Balancer limits at most 5 concurrent connections from itself to a DataNode, so too many KAs should not be a concern.


Hairong Kuang added a comment - 24/Sep/08 08:35 PM
Here is a patch for branch 18.

Hairong Kuang added a comment - 24/Sep/08 09:51 PM
This is a patch to branch 18 that has fixed the findbug warning.

Hairong Kuang added a comment - 24/Sep/08 10:13 PM
Attach one patch for the trunk, which also has fixed the findbugs warning introduced by the previous trunk patch.

Hadoop QA added a comment - 25/Sep/08 01:16 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12390875/balancerRM2.patch
against trunk revision 698721.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/console

This message is automatically generated.


Steve Loughran added a comment - 25/Sep/08 11:08 AM
The ping operation we are proposing for the lifecycle in HADOOP-3628 and HADOOP-3969 does a better health check, as it asks the far end if it thinks it is happy, and can detect a dead end (with suitable timeouts) and a machine that thinks it is unwell. But the most reliable way to check system health is to give that node real work and see if it completes it within time. That's something that could be done as a low priority job across a cluster: queue work and check the results, though you'd need to direct the work to specific nodes somehow.

Hairong Kuang added a comment - 25/Sep/08 06:56 PM
Yes, I agree. I did similar stuff in HADOOP-2188 in the context of IPC. For this issue, I do not want to take the effort to add a Ping interface to DataNode. I will use KeepAlive for now.

The failed unit test seems not related to this jira.


Hairong Kuang added a comment - 25/Sep/08 08:28 PM
I've committed this.

Tsz Wo (Nicholas), SZE added a comment - 25/Sep/08 09:12 PM
It is not clear to me that the test (TestDatanodeDeath) failed is not related since the patch changed quite a lot Datanode codes. Created HADOOP-4278 for TestDatanodeDeath.

Hudson added a comment - 26/Sep/08 05:03 PM

Hudson added a comment - 29/Oct/08 02:39 PM
Integrated in Hadoop-trunk #646 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/646/)
Fix the change log to reflect that is reverted in 0.18.2

Raghu Angadi added a comment - 23/Feb/09 11:35 PM
Changed 'fix version' to 0.19.0 since 0.18 change was reverted later.