Issue Details (XML | Word | Printable)

Key: HADOOP-3673
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Tsz Wo (Nicholas), SZE
Reporter: dhruba borthakur
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Deadlock in Datanode RPC servers

Created: 01/Jul/08 12:29 AM   Updated: 08/Jul/09 04:43 PM
Component/s: None
Affects Version/s: 0.18.0
Fix Version/s: 0.18.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 3673_20080702.patch 2008-07-02 08:05 PM Tsz Wo (Nicholas), SZE 4 kB
Text File Licensed for inclusion in ASF works 3673_20080702b.patch 2008-07-02 10:52 PM Tsz Wo (Nicholas), SZE 5 kB
Text File Licensed for inclusion in ASF works 3673_20080702c.patch 2008-07-02 11:26 PM Tsz Wo (Nicholas), SZE 7 kB
Text File Licensed for inclusion in ASF works 3673_20080702d.patch 2008-07-03 12:45 AM Tsz Wo (Nicholas), SZE 9 kB
Text File Licensed for inclusion in ASF works 3673_20080702e.patch 2008-07-03 01:06 AM Tsz Wo (Nicholas), SZE 9 kB
Text File Licensed for inclusion in ASF works 3673_20080707.patch 2008-07-07 06:12 PM Tsz Wo (Nicholas), SZE 9 kB
Text File Licensed for inclusion in ASF works 3673_20080707b.patch 2008-07-07 06:50 PM Tsz Wo (Nicholas), SZE 10 kB
Text File Licensed for inclusion in ASF works 3673_20080707b_0.18.patch 2008-07-08 06:53 PM Tsz Wo (Nicholas), SZE 10 kB
Issue Links:
Reference
 

Hadoop Flags: Reviewed
Resolution Date: 08/Jul/08 07:42 PM


 Description  « Hide
There is a deadlock scenario in the way Lease Recovery is triggered using the Datanode RPC server via HADOOP-3310.

Each Datanode has dfs.datanode.handler.count handler threads (default of 3). These handler threads are used to support the generation-stamp-dance protocol as described in HADOOP-1700.

Let me try to explain the scenario with an example. Suppose, a cluster has two datanodes. Also, let's assume that dfs.datanode.handler.count is set to 1. Suppose that there are two clients, each writing to a separate file with a replication factor of 2. Let's assume that both clients encounter an IO error and triggers the generation-stamp-dance protocol. The first client may invoke recoverBlock on the first datanode while the second client may invoke recoverBlock on the second datanode. Now, each of the datanode will try to make a getBlockMetaDataInfo() to the other datanode. But since each datanode has only 1 server handler threads, both threads will block for eternity. Deadlock!



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
dhruba borthakur added a comment - 01/Jul/08 12:32 AM
One solution would be to make the Datanode RPC serve spawn a new thread for every incoming RPC. This is acceptable because the RPCs that a datanode has to serve occur only when there is an IO error detected by a writer. It is not a very frequent operation.

Tsz Wo (Nicholas), SZE added a comment - 01/Jul/08 01:28 AM
Currently, there are two protocols, InterDatanodeProtocol and ClientDatanodeProtocol, supported by a single RPC server. Either one of the protocols won't introduce deadlock. So we have two more solutions as follows:
  1. run two rpc servers for these two protocols
  2. spawn new threads for one of the protocols (I think it is good to spawn a new thread for each call of ClientDatanodeProtocol.recoverBlock(...)).

Tsz Wo (Nicholas), SZE added a comment - 01/Jul/08 06:02 PM
One easy get-around is to set dfs.datanode.handler.count to a large number, say 20. Then the probability of hitting deadlock will be very low.

Raghu Angadi added a comment - 01/Jul/08 06:31 PM
#1 above with 2 rpc servers would work.
#2 Not sure how spawning new thread would help since there is no way to relinquish an RPC handler thread inside an RPC.

A third option would be to let the client talk to the datanodes before calling "recoverBlock()", so that we don't invoke an RPC from another RPC.


Raghu Angadi added a comment - 01/Jul/08 06:56 PM
Since adding a new RPC server is problematic because of new port required, I would vote for restructuring recoverBlock() a bit as mentioned above.

Tsz Wo (Nicholas), SZE added a comment - 01/Jul/08 07:01 PM
> A third option would be to let the client talk to the datanodes before calling "recoverBlock()", so that we don't invoke an RPC from another RPC.

It might not be possible since recoverBlock() calls both getBlockMetaDataInfo(...) and updateBlock(...).


Tsz Wo (Nicholas), SZE added a comment - 01/Jul/08 07:11 PM - edited
The third option would be DFSClient calling LeaseManager.recoverBlock(...) directly, i.e. no rpc call. Then, the entire ClientDatanodeProtocol can be removed.

dhruba borthakur added a comment - 01/Jul/08 08:28 PM
I think it is good from an architecture point of view to keep the client relatively lightweight as much as possible. This will allow us to port the client to many other languages. So, i would like to persist with the current design (of making the primary datanode invoke recoverBlock).

What if the RPC.Server can me made to support dynamic handler threads? For example, when the Datanode creates a RPC Server, it can specify a handler count of 0. The RPC Server code will interpret a count of 0 to mean that the application wants to create a new thread to service each and every RPC request. In this case, the RPC Server code will internally create a single handler thread for this Server. This special handler thread will invoke callQueue.take() to retrieve a new incoming call, and then fork off a new thread to process this call. Care has to be taken to ensure that responses from calls from the same connection are sequentialized and processed in order.


Raghu Angadi added a comment - 01/Jul/08 09:27 PM
A thin client is better. And DFS client has never gotten thinner in the past. The protocol needs to be really really stable before there is such a client. Regd porting, mostly likely a combination of a very thin wrapper over a some what fatter (Java?) library is what might emerge in the future.

For 0.18, simple and straight forward fix is better. RPC server feature might not be too intrusive for 0.18 either.

> Care has to be taken to ensure that responses from calls from the same connection are sequentialized and processed in order.
No ordering is required.


Tsz Wo (Nicholas), SZE added a comment - 01/Jul/08 10:17 PM
> No ordering is required.

That's true since RPC are blocking calls.


Tsz Wo (Nicholas), SZE added a comment - 01/Jul/08 10:33 PM
Raghu pointed out that since the primary datanode makes RPC calls to itself, it is easy to hit a deadlock. If we modify the codes such that it does normal method call (instead of RPC) to itself, the deadlock situation will be much unlikely.

Hairong Kuang added a comment - 02/Jul/08 12:13 AM
> The third option would be DFSClient calling LeaseManager.recoverBlock(...) directly, i.e. no rpc call.
I vote for this option. This is a simple solution and solves the problem completely. It is cleaner too because it remoes ClientDatanodeProtocol. I do not think that it makes the client a lot fatter. The client is already very fat. Making a client lightweight is not the reason to ask datanodes to behave as a client.

Sanjay Radia added a comment - 02/Jul/08 12:54 AM
The problems with the third option is that it will not work when we extend HDFS to support multiple appenders; in this case recoverBlock should not happen on the client side but on the DataNode side so that a single recovery action is performed for all appenders.

Tsz Wo (Nicholas), SZE added a comment - 02/Jul/08 01:00 AM
Fourth option: add a counter in each datanode to limit the number of concurrent recoverBlock(...) RPC calls, so that the number of concurrent recoverBlock(...) RPC calls at anytime is less then the number of RPC handlers. Then, there is no more deadlock.

Tsz Wo (Nicholas), SZE added a comment - 02/Jul/08 01:12 AM
If Dhruba's solution "spawn a new thread for every incoming RPC" is not hard to implement, it seems to be the best.

Sameer Paranjpye added a comment - 02/Jul/08 02:31 PM
Is there any reason that the primary datanode alone can do recovery? If not, then this becomes classic Dining Philosophers. The client always invokes recoverBlock() on the lowest (by some sort order) datanode in it's pipeline. This 'recovery primary' in turn invokes other nodes in the pipeline in the same sort order. We'd still have to remove the datanodes RPC to itself.

Hairong Kuang added a comment - 02/Jul/08 05:58 PM
I do not like the idea of spawning a new thread for every incoming RPC. A RPC server does need to limit the amount of resource it uses. What if there are thousands of blocks that need to be recovered concurrently?

Sameer's idea seems simple but works. This means that all datanodes need to have a global order, for example the alphabetical order of it name. A client simply sorts the datanodes in its pipeline before contacting the primary datanode.


Arun C Murthy added a comment - 02/Jul/08 06:10 PM
Is this the same as HADOOP-3657?

Raghu Angadi added a comment - 02/Jul/08 06:24 PM
> Is this the same as HADOOP-3657?
I am 99% certain it is. Please let me know when you hit HADOOP-3657 again so that we can confirm.

Sanjay Radia added a comment - 02/Jul/08 06:52 PM
Sameer is arguing that there is one additional condition under which the deadlock can occur and that ensuring that all recovery occurs in specific orders (e.g. by id of the datanode) avoids this additional deadlock situation.

Tsz Wo (Nicholas), SZE added a comment - 02/Jul/08 07:34 PM
I am implementing Sameer's solution.

Tsz Wo (Nicholas), SZE added a comment - 02/Jul/08 08:05 PM
3673_20080702.patch:
  • Pick the "least" datanode as the primary datanode
  • Do not calling itself with RPC
  • Increase default dfs.datanode.handler.count from 3 to 10

Sameer Paranjpye added a comment - 02/Jul/08 08:31 PM
How are the datanodes sorted? By name, ip or something else? I'd personally prefer sorting by ip:port so that we're protected from name resolution vagaries. Also (this is a nit), why sort when all we need to do is find the lowest Datanode.

Does the number of handlers need to be in hadoop-default.xml? Maybe it should be a 'private' config variable, used for debugging with a reasonable hardcoded default.


Tsz Wo (Nicholas), SZE added a comment - 02/Jul/08 08:49 PM
> How are the datanodes sorted? By name, ip or something else? I'd personally prefer sorting by ip:port so that we're protected from name resolution vagaries.

Datanodes are sorted by ip:port.

> Also (this is a nit), why sort when all we need to do is find the lowest Datanode.

Since the array is small (it depends on replication factor, default is 3), sort and find min is essentially the same. Sort is more convenient in this case.

> Does the number of handlers need to be in hadoop-default.xml?

Handler counts link dfs.namenode.handler.count and mapred.job.tracker.handler.count are in hadoop-default.xml. We probably should do the same for dfs.datanode.handler.count.


Raghu Angadi added a comment - 02/Jul/08 09:00 PM
Why do we want to increase the number of handlers in this patch? Keeping it at 3 would give us better confidence that problem is really fixed. 3 handlers is enough since there are not that many RPCs to datanodes.

Sameer Paranjpye added a comment - 02/Jul/08 10:35 PM
Note that while the lowest Datanode should be orchestrating block recovery the Datanode order for data should not change. Data should go continue to the closest node first, if not it could make unnecessary cross rack trips.

Tsz Wo (Nicholas), SZE added a comment - 02/Jul/08 10:52 PM
3673_20080702b.patch:
  • Use min instead of sort. The array ordering is preserved.
  • Keeping default dfs.namenode.handler.count to 3

Tsz Wo (Nicholas), SZE added a comment - 02/Jul/08 11:26 PM
3673_20080702c.patch: the previous patch cannot be compiled. Removed some un-used codes in FSImage.

Raghu Angadi added a comment - 03/Jul/08 12:30 AM
+1. patch looks good. Manual test to verify this works will be good.

Tsz Wo (Nicholas), SZE added a comment - 03/Jul/08 12:45 AM
3673_20080702d.patch: added a test

Tsz Wo (Nicholas), SZE added a comment - 03/Jul/08 12:53 AM
Passed all tests locally. Try Hudson ...

Tsz Wo (Nicholas), SZE added a comment - 03/Jul/08 01:06 AM
3673_20080702e.patch: changed a few comments/output messages in the test

Hadoop QA added a comment - 03/Jul/08 03:25 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12385160/3673_20080702e.patch
against trunk revision 673517.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2788/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2788/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2788/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2788/console

This message is automatically generated.


dhruba borthakur added a comment - 03/Jul/08 11:02 PM
+1 code looks good. The test case that you added should already be covered by TestDatanodeDeath, isn't it?

Tsz Wo (Nicholas), SZE added a comment - 03/Jul/08 11:11 PM
I set dfs.datanode.handler.count to 1in the new test. So it also tests if there is a deadlock. Without the patch, the new test will fail.

dhruba borthakur added a comment - 03/Jul/08 11:35 PM
Ok, got it. thanks.

Raghu Angadi added a comment - 03/Jul/08 11:44 PM
> Without the patch, the new test will fail.
The test will pass even if patch does not pick the "min" datanode, right? It mainly makes sure that a datanode does not invoke RPC to itself. A manual test with multiple handlers (2 or 3) might still be useful.

Raghu Angadi added a comment - 04/Jul/08 12:01 AM
I didn't mean to delay committing the patch. The patch can be checked in now.

Tsz Wo (Nicholas), SZE added a comment - 04/Jul/08 12:06 AM
> The patch can be checked in now.

The patch does not apply anymore due to HADOOP-2885. Anyway, you make a point, I will modify the test to start multiple slow-writers.


Tsz Wo (Nicholas), SZE added a comment - 07/Jul/08 06:12 PM
3673_20080707.patch: updated with trunk, not changed the test yet.

Tsz Wo (Nicholas), SZE added a comment - 07/Jul/08 06:50 PM
3673_20080707b.patch: updated that tests to start 10 concurrent slow writers.

Tsz Wo (Nicholas), SZE added a comment - 07/Jul/08 08:43 PM
Passed all tests locally. Try Hudson again.

Hadoop QA added a comment - 08/Jul/08 05:20 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12385432/3673_20080707b.patch
against trunk revision 674834.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2813/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2813/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2813/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2813/console

This message is automatically generated.


Tsz Wo (Nicholas), SZE added a comment - 08/Jul/08 06:53 PM
3673_20080707b_0.18.patch: for 0.18

Raghu Angadi added a comment - 08/Jul/08 07:42 PM
I just committed this. Thanks Nicholas!

Hudson added a comment - 22/Aug/08 12:34 PM