|
[
Permlink
| « Hide
]
dhruba borthakur added a comment - 23/Jun/08 06:49 PM
It would be nice if the code can be organized in such a way that the FSnamesystem global lock is not held when the datanode's network location is resolved. Otherwise, cluster restart times could potentially take much much longer.
Dhruba's comment makes sense. The attached inital patch still let the thread hold the global lock while resolving a network location. I need to figure out how not to hold the lock without any risk of data structure inconsistency.
The attached patch resolves a data node's network location when it registers. It also lets a data node to randomly back off its block report when the data node starts up. I think it also makes sense to do the same in the JobTracker. There too the resolution is async. With
I'm +1 to doing all of the resolution synchronously in both the namenode and jobtracker.
This patch additionally improves the performance of network resolution during registration by prersolving the network locations of every included host and storing them in a cache.
I haven't gone through the patch in detail but one thing which struck me is that you removed the cache from ScriptBasedMapping and moved it to the dfs part of the framework. That might be a problem for MR part of the framework.
Ok, this patch still keeps the cache in SriptBasedMapping. But pre-resolve datanodes' network locations only when script based mapping is used.
This patch applies to trunk. Additionally
1. It provides a cached implementation of DNSToSwitchMapping, which caches all resolved DNS to switch mapping. Any implementation could extend CachedDNSToSwitchMapping if this helps improve its performance. 2. ScriptBasedMapping is CachedDNSToSwitchMapping while RawScriptBasedMapping resolves a location only by running a script. 3. Name node pre-resolves the network location of all datanodes in the include list only if the configured DNSToSwitchMapping is cached. As far as I can see, the resolution happens with the global FSNamesystem lock held. Is that true? Dhruba had a concern there and I don't see a nice way to handle that either. The patch looks good otherwise (subject to hudson passing it and successfully running MapReduce jobs with this patch).
It's very hard to release the global lock during network resolution without breaking the data structure consistency. So what I did is to pre-resolve the network locations of datanodes that are in the include file. So the resolution during the registration is going to be fast.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12385685/netResolution3.patch against trunk revision 676069. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2850/testReport/ This message is automatically generated. This patch fixed the javadoc warning.
One way to move the resolution out of global lock :
resolveNetworkLocation does not strictly need DatanodeDescriptor. public void registerDatanode(DatanodeRegistration nodeReg) throws IOException { String networkLocation = resolveNeworklocation(nodeReg.getHost()); internalRegisterDatanode(nodeReg, networkLocation); //holds global lock. } Does the above work? Note that pre-resolving hosts in include file might not help start up since
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12386366/netResolution4.patch against trunk revision 677839. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2899/testReport/ This message is automatically generated. > Does the above work?
I thought about this solution too. But if you look at the registration code, sometimes there is no need to resolve a node's network location if the node has already registered. So network resolution in the front could be an overhead. > Note that pre-resolving hosts in include file might not help start up. > So network resolution in the front could be an overhead.
this may not be a problem since a DataNode would not re-register unless there is a real problem/bug. Not sure if we need to optimize that. Even if we want to, we can make 'internalRegisterDatanode()' throw an exception to indicate that netwo needs to be resolved before calling it. I think doing this way will simplify the code and patch even further. > Pre-resolving should help since it resolves network locations in batch and therefore reducing the number of calls to the rack script dramatically. I talked with Raghu and understood his concern is that the number of calls to DNS resolution might impact the performance of network location resolution performance. From my experiment, this seems not a big concern. Instead, reducing the number of calls to the script would greatly improve the resolution performance.
But this new patch reduces all possible calls to DNS resolution. It has all the following changes: -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12386770/netResolution5.patch against trunk revision 679286. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2936/testReport/ This message is automatically generated. Fixed the failed unit tests.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12387066/netResolution6.patch against trunk revision 680577. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2965/testReport/ This message is automatically generated.
+1 for the improvements.
But, I don't think DNS resultions is non-issue. see I noticed that the new files are not committed. Trunk doesnt compile
Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||