Issue Details (XML | Word | Printable)

Key: HADOOP-3620
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Hairong Kuang
Reporter: Hairong Kuang
Votes: 0
Watchers: 4
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Namenode should synchronously resolve a datanode's network location when the datanode registers

Created: 23/Jun/08 06:17 PM   Updated: 08/Jul/09 04:43 PM
Return to search
Component/s: None
Affects Version/s: 0.18.0
Fix Version/s: 0.19.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works netResolution.patch 2008-06-24 12:16 AM Hairong Kuang 17 kB
Text File Licensed for inclusion in ASF works netResolution1.patch 2008-06-26 12:33 AM Hairong Kuang 23 kB
Text File Licensed for inclusion in ASF works netResolution2.patch 2008-06-30 10:59 PM Hairong Kuang 22 kB
Text File Licensed for inclusion in ASF works netResolution3.patch 2008-07-09 11:31 PM Hairong Kuang 27 kB
Text File Licensed for inclusion in ASF works netResolution4.patch 2008-07-18 12:49 AM Hairong Kuang 27 kB
Text File Licensed for inclusion in ASF works netResolution5.patch 2008-07-23 11:58 PM Hairong Kuang 32 kB
Text File Licensed for inclusion in ASF works netResolution6.patch 2008-07-29 12:34 AM Hairong Kuang 33 kB
Issue Links:
Incorporates
 

Hadoop Flags: Reviewed
Resolution Date: 04/Aug/08 11:34 PM


 Description  « Hide
Release 0.18.0 removes the rpc timeout. So the namenode is ok to resolve a datanode's network location when the datanode registers. This could remove quite a lot of unnecessary code in both datanode and namenode to handle asynchronous network location resolution and avoid many potential bugs.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
dhruba borthakur added a comment - 23/Jun/08 06:49 PM
It would be nice if the code can be organized in such a way that the FSnamesystem global lock is not held when the datanode's network location is resolved. Otherwise, cluster restart times could potentially take much much longer.

Hairong Kuang added a comment - 24/Jun/08 12:16 AM - edited
Dhruba's comment makes sense. The attached inital patch still let the thread hold the global lock while resolving a network location. I need to figure out how not to hold the lock without any risk of data structure inconsistency.

The attached patch resolves a data node's network location when it registers. It also lets a data node to randomly back off its block report when the data node starts up.


Amar Kamat added a comment - 24/Jun/08 06:16 AM
I think it also makes sense to do the same in the JobTracker. There too the resolution is async. With HADOOP-3590 getting fixed, the problem is that the tasks will be scheduled randomly until the node gets resolved. Some of the test cases assume that the all the TTs are resolved before the job gets submitted which might not be true always. Thoughts?

Owen O'Malley added a comment - 24/Jun/08 06:23 AM
I'm +1 to doing all of the resolution synchronously in both the namenode and jobtracker.

Hairong Kuang added a comment - 26/Jun/08 12:33 AM
This patch additionally improves the performance of network resolution during registration by prersolving the network locations of every included host and storing them in a cache.

Devaraj Das added a comment - 27/Jun/08 12:51 PM
I haven't gone through the patch in detail but one thing which struck me is that you removed the cache from ScriptBasedMapping and moved it to the dfs part of the framework. That might be a problem for MR part of the framework.

Hairong Kuang added a comment - 30/Jun/08 10:59 PM
Ok, this patch still keeps the cache in SriptBasedMapping. But pre-resolve datanodes' network locations only when script based mapping is used.

Hairong Kuang added a comment - 09/Jul/08 11:31 PM
This patch applies to trunk. Additionally
1. It provides a cached implementation of DNSToSwitchMapping, which caches all resolved DNS to switch mapping. Any implementation could extend CachedDNSToSwitchMapping if this helps improve its performance.
2. ScriptBasedMapping is CachedDNSToSwitchMapping while RawScriptBasedMapping resolves a location only by running a script.
3. Name node pre-resolves the network location of all datanodes in the include list only if the configured DNSToSwitchMapping is cached.

Devaraj Das added a comment - 10/Jul/08 11:27 AM
As far as I can see, the resolution happens with the global FSNamesystem lock held. Is that true? Dhruba had a concern there and I don't see a nice way to handle that either. The patch looks good otherwise (subject to hudson passing it and successfully running MapReduce jobs with this patch).

Hairong Kuang added a comment - 11/Jul/08 12:54 AM
It's very hard to release the global lock during network resolution without breaking the data structure consistency. So what I did is to pre-resolve the network locations of datanodes that are in the include file. So the resolution during the registration is going to be fast.

Hadoop QA added a comment - 13/Jul/08 01:56 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12385685/netResolution3.patch
against trunk revision 676069.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 6 new or modified tests.

-1 javadoc. The javadoc tool appears to have generated 1 warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2850/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2850/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2850/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2850/console

This message is automatically generated.


Hairong Kuang added a comment - 18/Jul/08 12:49 AM
This patch fixed the javadoc warning.

Raghu Angadi added a comment - 18/Jul/08 01:16 AM - edited
One way to move the resolution out of global lock :

resolveNetworkLocation does not strictly need DatanodeDescriptor.
So regeisterDatanode() could look something like :

public void registerDatanode(DatanodeRegistration nodeReg) throws IOException {
     String networkLocation = resolveNeworklocation(nodeReg.getHost());
     internalRegisterDatanode(nodeReg, networkLocation); //holds global lock.
}

Does the above work?

Note that pre-resolving hosts in include file might not help start up since

  • resolving serially at the beginning still increases the start up time.
  • "normalizing host names" does multiple DNS resolves.

Hadoop QA added a comment - 18/Jul/08 09:45 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12386366/netResolution4.patch
against trunk revision 677839.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 6 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2899/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2899/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2899/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2899/console

This message is automatically generated.


Hairong Kuang added a comment - 21/Jul/08 06:16 PM
> Does the above work?
I thought about this solution too. But if you look at the registration code, sometimes there is no need to resolve a node's network location if the node has already registered. So network resolution in the front could be an overhead.

> Note that pre-resolving hosts in include file might not help start up.
Pre-resolving should help since it resolves network locations in batch and therefore reducing the number of calls to the rack script dramatically.


Raghu Angadi added a comment - 21/Jul/08 06:40 PM - edited
> So network resolution in the front could be an overhead.
this may not be a problem since a DataNode would not re-register unless there is a real problem/bug. Not sure if we need to optimize that. Even if we want to, we can make 'internalRegisterDatanode()' throw an exception to indicate that netwo needs to be resolved before calling it.

I think doing this way will simplify the code and patch even further.

> Pre-resolving should help since it resolves network locations in batch and therefore reducing the number of calls to the rack script dramatically.
Only if the script can do the resolutions in parallel. Does the default script make use of this? Also there are 2 DNS resolutions done for each host serially inside namenode to 'normalize' the host names, right? In addition, many installations may not specify include hosts.


Hairong Kuang added a comment - 23/Jul/08 11:58 PM
I talked with Raghu and understood his concern is that the number of calls to DNS resolution might impact the performance of network location resolution performance. From my experiment, this seems not a big concern. Instead, reducing the number of calls to the script would greatly improve the resolution performance.

But this new patch reduces all possible calls to DNS resolution. It has all the following changes:
1. Increase maxArg of ScriptBasedMapping from 20 to 100;
2. CachedDNSToSwitchMap maps IP addresses to the network location;
3. It allows include/exclude host files to contain ip addresses.


Hadoop QA added a comment - 24/Jul/08 01:49 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12386770/netResolution5.patch
against trunk revision 679286.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 6 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2936/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2936/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2936/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2936/console

This message is automatically generated.


Hairong Kuang added a comment - 29/Jul/08 12:34 AM
Fixed the failed unit tests.

Hadoop QA added a comment - 29/Jul/08 02:49 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12387066/netResolution6.patch
against trunk revision 680577.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 6 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2965/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2965/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2965/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2965/console

This message is automatically generated.


Raghu Angadi added a comment - 29/Jul/08 10:50 PM

But this new patch reduces all possible calls to DNS resolution. It has all the following changes: [...]

+1 for the improvements.

I talked with Raghu and understood his concern is that the number of calls to DNS resolution might impact the performance of network location resolution performance. From my experiment, this seems not a big concern.

But, I don't think DNS resultions is non-issue. see HADOOP-3694 for e.g. Its good that the latest patch reduces DNS resolutions.


Devaraj Das added a comment - 31/Jul/08 12:37 PM
+1

Hairong Kuang added a comment - 04/Aug/08 11:35 PM
I've just committed this.

Amareshwari Sriramadasu added a comment - 05/Aug/08 04:33 AM
I noticed that the new files are not committed. Trunk doesnt compile

Hudson added a comment - 22/Aug/08 12:34 PM