|
[
Permlink
| « Hide
]
dhruba borthakur added a comment - 01/Jul/08 12:32 AM
One solution would be to make the Datanode RPC serve spawn a new thread for every incoming RPC. This is acceptable because the RPCs that a datanode has to serve occur only when there is an IO error detected by a writer. It is not a very frequent operation.
Currently, there are two protocols, InterDatanodeProtocol and ClientDatanodeProtocol, supported by a single RPC server. Either one of the protocols won't introduce deadlock. So we have two more solutions as follows:
One easy get-around is to set dfs.datanode.handler.count to a large number, say 20. Then the probability of hitting deadlock will be very low.
#1 above with 2 rpc servers would work.
#2 Not sure how spawning new thread would help since there is no way to relinquish an RPC handler thread inside an RPC. A third option would be to let the client talk to the datanodes before calling "recoverBlock()", so that we don't invoke an RPC from another RPC. Since adding a new RPC server is problematic because of new port required, I would vote for restructuring recoverBlock() a bit as mentioned above.
> A third option would be to let the client talk to the datanodes before calling "recoverBlock()", so that we don't invoke an RPC from another RPC.
It might not be possible since recoverBlock() calls both getBlockMetaDataInfo(...) and updateBlock(...). The third option would be DFSClient calling LeaseManager.recoverBlock(...) directly, i.e. no rpc call. Then, the entire ClientDatanodeProtocol can be removed.
I think it is good from an architecture point of view to keep the client relatively lightweight as much as possible. This will allow us to port the client to many other languages. So, i would like to persist with the current design (of making the primary datanode invoke recoverBlock).
What if the RPC.Server can me made to support dynamic handler threads? For example, when the Datanode creates a RPC Server, it can specify a handler count of 0. The RPC Server code will interpret a count of 0 to mean that the application wants to create a new thread to service each and every RPC request. In this case, the RPC Server code will internally create a single handler thread for this Server. This special handler thread will invoke callQueue.take() to retrieve a new incoming call, and then fork off a new thread to process this call. Care has to be taken to ensure that responses from calls from the same connection are sequentialized and processed in order. A thin client is better. And DFS client has never gotten thinner in the past. The protocol needs to be really really stable before there is such a client. Regd porting, mostly likely a combination of a very thin wrapper over a some what fatter (Java?) library is what might emerge in the future.
For 0.18, simple and straight forward fix is better. RPC server feature might not be too intrusive for 0.18 either. > Care has to be taken to ensure that responses from calls from the same connection are sequentialized and processed in order. > No ordering is required.
That's true since RPC are blocking calls. Raghu pointed out that since the primary datanode makes RPC calls to itself, it is easy to hit a deadlock. If we modify the codes such that it does normal method call (instead of RPC) to itself, the deadlock situation will be much unlikely.
> The third option would be DFSClient calling LeaseManager.recoverBlock(...) directly, i.e. no rpc call.
I vote for this option. This is a simple solution and solves the problem completely. It is cleaner too because it remoes ClientDatanodeProtocol. I do not think that it makes the client a lot fatter. The client is already very fat. Making a client lightweight is not the reason to ask datanodes to behave as a client. The problems with the third option is that it will not work when we extend HDFS to support multiple appenders; in this case recoverBlock should not happen on the client side but on the DataNode side so that a single recovery action is performed for all appenders.
Fourth option: add a counter in each datanode to limit the number of concurrent recoverBlock(...) RPC calls, so that the number of concurrent recoverBlock(...) RPC calls at anytime is less then the number of RPC handlers. Then, there is no more deadlock.
If Dhruba's solution "spawn a new thread for every incoming RPC" is not hard to implement, it seems to be the best.
Is there any reason that the primary datanode alone can do recovery? If not, then this becomes classic Dining Philosophers. The client always invokes recoverBlock() on the lowest (by some sort order) datanode in it's pipeline. This 'recovery primary' in turn invokes other nodes in the pipeline in the same sort order. We'd still have to remove the datanodes RPC to itself.
I do not like the idea of spawning a new thread for every incoming RPC. A RPC server does need to limit the amount of resource it uses. What if there are thousands of blocks that need to be recovered concurrently?
Sameer's idea seems simple but works. This means that all datanodes need to have a global order, for example the alphabetical order of it name. A client simply sorts the datanodes in its pipeline before contacting the primary datanode. Is this the same as
> Is this the same as
I am 99% certain it is. Please let me know when you hit Sameer is arguing that there is one additional condition under which the deadlock can occur and that ensuring that all recovery occurs in specific orders (e.g. by id of the datanode) avoids this additional deadlock situation.
I am implementing Sameer's solution.
3673_20080702.patch:
How are the datanodes sorted? By name, ip or something else? I'd personally prefer sorting by ip:port so that we're protected from name resolution vagaries. Also (this is a nit), why sort when all we need to do is find the lowest Datanode.
Does the number of handlers need to be in hadoop-default.xml? Maybe it should be a 'private' config variable, used for debugging with a reasonable hardcoded default. > How are the datanodes sorted? By name, ip or something else? I'd personally prefer sorting by ip:port so that we're protected from name resolution vagaries.
Datanodes are sorted by ip:port. > Also (this is a nit), why sort when all we need to do is find the lowest Datanode. Since the array is small (it depends on replication factor, default is 3), sort and find min is essentially the same. Sort is more convenient in this case. > Does the number of handlers need to be in hadoop-default.xml? Handler counts link dfs.namenode.handler.count and mapred.job.tracker.handler.count are in hadoop-default.xml. We probably should do the same for dfs.datanode.handler.count. Why do we want to increase the number of handlers in this patch? Keeping it at 3 would give us better confidence that problem is really fixed. 3 handlers is enough since there are not that many RPCs to datanodes.
Note that while the lowest Datanode should be orchestrating block recovery the Datanode order for data should not change. Data should go continue to the closest node first, if not it could make unnecessary cross rack trips.
3673_20080702b.patch:
3673_20080702c.patch: the previous patch cannot be compiled. Removed some un-used codes in FSImage.
+1. patch looks good. Manual test to verify this works will be good.
3673_20080702d.patch: added a test
Passed all tests locally. Try Hudson ...
3673_20080702e.patch: changed a few comments/output messages in the test
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12385160/3673_20080702e.patch against trunk revision 673517. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2788/testReport/ This message is automatically generated. +1 code looks good. The test case that you added should already be covered by TestDatanodeDeath, isn't it?
I set dfs.datanode.handler.count to 1in the new test. So it also tests if there is a deadlock. Without the patch, the new test will fail.
> Without the patch, the new test will fail.
The test will pass even if patch does not pick the "min" datanode, right? It mainly makes sure that a datanode does not invoke RPC to itself. A manual test with multiple handlers (2 or 3) might still be useful. I didn't mean to delay committing the patch. The patch can be checked in now.
> The patch can be checked in now.
The patch does not apply anymore due to HADOOP-2885. Anyway, you make a point, I will modify the test to start multiple slow-writers. 3673_20080707.patch: updated with trunk, not changed the test yet.
3673_20080707b.patch: updated that tests to start 10 concurrent slow writers.
Passed all tests locally. Try Hudson again.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12385432/3673_20080707b.patch against trunk revision 674834. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2813/testReport/ This message is automatically generated. 3673_20080707b_0.18.patch: for 0.18
I just committed this. Thanks Nicholas!
Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||