|
Some thoughts -
1) Make the DNS->Switch mapping an interface class. 1.1) interface DNStoSwitchMap { public String resolve(String dnsname); } 1.2) The switch string format is the same as it exists today (documented in https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf 1.3) The default implementation of the interface, packaged with hadoop, could simply look up a table of dns->switch mapping created statically. 2) The DataNode, today, takes the location as an argument. This is not needed anymore, and hence the associated code would go away. 4) The JobTracker creates the NetworkTopology for the TaskTrackers exactly how the NameNode does it. In our environment, task placement based on "distance" (o.a.h.n.NetworkTopology.getDistance), isn't that much relevant since we only have flat racks of machines. But we might make the framework ready for it as well (assuming it is not too much work). Does the above make sense? Just a few notes.
From an ops perspective, it is important that this mapping be highly pluggable in an easy way. The ability to have hadoop call some sort of executable (not necessarily a script) means we can do fancy things with /etc/netmasks or LDAP lookups or ... . Ideally, every sort of mapping would have a callout rather than having one big one. KISS is important here. [Remember, most admins--myself included--are not hardcore Java people. ] FWIW, most implementations of autofs include similar functionality called executable maps where the key is passed to an exec and the exec returns the location of the mount. So the practice has at least a little bit of traction. [In fact, auto.net aka /net on Linux uses this method.] Additionally,I think moving this functionality to be done on the namenode makes this significantly easier to manage as a grid scales up. There is also the issue of should the namenode 'trust' the datanode to report the proper location. I understand that the datanode and namenode have to trust each other at some point during node bringup, but I think it makes a lot of sense to let the namenode be in charge of data locality. Hopefuly this was helpful. I agree that an exec as a simple to config option should be required.
Currently in dfs a datanode can get its network location either from the command line or by running a pluggable script at the startup time. The property is defined in the default configuration file as below.
<property> Attached is one version of the patch. Hasn't been tested on large clusters yet. Here are the main changes:
1) DFS part updated to use the newly defined DNSToSwitchMapping interface. 1.1) The datanode doesn't send the switch info as part of registration, rather, the namenode gets that info through the DNSToSwitchMapping.resolve 2) The default implementation of the DNSToSwitchMapping assumes a script based resolution (ScriptBasedMapping). If the script is defined, then DEFAULT_RACK is assumed. 3) The JobTracker maintains the network topology and updates it (if required) whenever a tasktracker sends a heartbeat. 4) The JobInProgress maintains a compile-time-configurable number of task caches. For e.g., the first level cache is the map of leaf level Nodes to TIPs, the second level is the map of the rack level nodes to TIPs in that rack, the third level is for the level above rack and so on.. The default number of caches here is hardcoded to 2. 5) At runtime, the findNewTask would use these caches to assign a task to a checked-in tasktracker. Patch up for review. Attached is a tested patch. The patch has better documentation too. One of the important changes in the patch to do with testcases is the way it handles multiple datanodes in the same machine. Since the namenode should be able to distinguish between them in terms of the dnsToRackId mapping, I added a configuration option called "slave.host.name" that would take effect only when the framework is run under junit. Ditto applies to the jobtracker/tasktrackers. Also all communications to these dummy hostnames are redirected to "localhost" (indirectly via NetUtils.createSocketAddress).
Patch is up for review. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12371666/1985.v1.patch against trunk revision r604058. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs -1. The patch appears to introduce 2 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1344/testReport/ This message is automatically generated. +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12371728/1985.v2.patch against trunk revision r604451. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1356/testReport/ This message is automatically generated. Fixed an issue in the Scan benchmark.
I ran the Scan benchmark attached in the patch (the benchmark just scans inputs; no sort/shuffle/reduce).
The input data was generated on a cluster of ~300 machines. Randomwriter with the following config was run - 40 maps per host, each map configured to generate 1G, dfs blk size 256 MB. The input data set was thus around 11.6 TB. Another cluster of ~900 nodes, with its dfs pointing to the earlier 300 node cluster, was used to run the Scan benchmark. The number of maps was equal to the number of dfs blocks in the input. The two clusters had common racks but no common nodes. With the rack aware patch, the scan program took 25 minutes (with 90% rack-local tasks), and without the patch, the scan took around 35 minutes, ~30% improvement. Cancelling patch to get the latest patch through hudson
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12372415/1985.v4.patch against trunk revision . @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1454/testReport/ This message is automatically generated. Fixed a problem with NetworkTopology's getNode method.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12372631/1985.v5.patch against trunk revision . @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1503/testReport/ This message is automatically generated. The patch is out-of-sync with the trunk.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12373076/1985.v6.patch against trunk revision r611760. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1581/testReport/ This message is automatically generated. I'm worried about the time and memory performance of this. Have you run a sort with dfs cluster == map/reduce cluster and compared running times and job tracker memory size? We've already seen cases where the current pollForNewTask causes performance problems...
It bothers me that the max levels is hard coded rather than configurable. From a style point of view, I probably would have defined a new class rather than use nested java.utils containers like List<Map<Node, List<TaskInProgress>>>. That way if we change the representation later it won't be scattered through the code. In particular, I can imagine wanting to have the data structure be something like: Did you need to change the definition of findNewTask? I don't see it in the patch. This needs user documentation in forrest. The java doc on DNSToSwitchMapping.resolve should probably mention that they must cache if their operation is expensive. Although there isn't a way to clear or update that cache, which might be a problem at some point... You don't really need the Scan example, you could use the GenericMRLoadGenerator with a -keepmap of 0. In the longer term I think a configured mapping class would be useful. A class named <property> <name>hadoop.configured.node.mapping</name> <value>host1=/rack1,host2=/rack1,host3=/rack4</value> </property>
I assume you meant findNewTask giving performance problems. To clarify (for the benefit of others), the JobTracker would consume more memory due to two reasons: The running time performance is helped by the caches. If the TIP is present in some cache the complexity of is O(num-level), since it takes O(1) at each level to find a TIP for a TaskTracker, no? The linear search for TIP (if it was a cache miss), is there even currently(existing codebase). The only additional thing here are the lookups when the level is more than 1. I did run the sort with dfs-cluster == map/reduce-cluster and the numbers were very comparable. Nothing concerning there..
I was thinking that the most typical cases would require just two levels - host, rack, and that's why i made this a compile time constant. But if it makes sense to make that runtime configurable, I can enable that behavior..
How about providing get/set APIs to the existing datastructure. The datastructure works for all cases with arbitrary number of levels (host, rack, switch, datacenter,..) (since it is a list of mappings from Node to TIPs). I didn't want to introduce Strings in the mapping since the NetworkTopology provides a Node abstraction for everything. If we went to Strings then we have an additional step of getting the Node from the String name (and vice versa), parsing strings to get to the Node, etc., which can be easily avoided by having the mappings based on Node.
Yes, I changed the definition of findNewTask. In the patch look for Find a new task to run. The diff doesn't have the line findNewTask. It just has the comment above it.
I have that in the 1985.v6.patch. Look for cluster_setup.xml and hdfs_design.xml, where I talk about how rack config can be setup. Did you mean something else?
Agreed regarding the documentation on the cache part. The update of the cache could be handled by the implementation of DNSToSwitchMapping, no? I can imagine a case, where the implementation starts a thread that periodically contacts some service and updates its cache. This is transparent to clients calling DNSToSwitchMapping.resolve.
Okay.
In the patch this is handled by a specific implementation of the DNSToSwitchMapping called StaticMapping, and that provides an API to set up the mapping from host to rackid (used in testcases). But I think I should be able to set things in the configuration and StaticMapping could initialize itself with the mapping provided there. I'll look at that. Here is a patch with some changes in the task cache datastructure. Also, there are changes that tries to ensure that rack (and higher level) locality is preserved for failed/speculative tasks also... It doesn't delete TIPs from a node cache until the TIP is complete or the node happens to be the host itself. The logic is that we should not delete TIPs from the rack level cache to avoid having the speculative/failed TIPs pay a peformance penalty if some other tasktracker from the same rack gets to run that failed/speculative task. We delete TIPs from the host cache since we don't execute the same tip on a host that failed to execute it earlier...
Would highly appreciate a quick review on this one. Patch attached with review comments incorporated.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12373500/1985.v9.patch against trunk revision r613359. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1648/testReport/ This message is automatically generated. core-tests seems to have failed. But I am not able to get to what failed using the link http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1648/testReport/
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12373500/1985.v9.patch against trunk revision r613499. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1665/testReport/ This message is automatically generated. Thanks Nigel for pointing out that doc build failure might be incorrectly reported as a core-tests failure (we should address this issue). There was a problem in the forrest doc in the patch. This patch fixes that.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12373749/1985.v10.patch against trunk revision r614301. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1677/testReport/ This message is automatically generated. I think the core tests failed due to
One more of those occasions when the patch went out-of-sync.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12374045/1985.v11.patch against trunk revision 614721. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1679/testReport/ This message is automatically generated. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12374045/1985.v11.patch against trunk revision 615686. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit -1. The applied patch generated 289 release audit warnings (more than the trunk's current 288 warnings). findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1690/testReport/ This message is automatically generated. I just added our the release audit to the patch process. It looks for an increase in the number of files that don't have property license headers. This patch is missing one for src/java/org/apache/hadoop/net/ScriptBasedMapping.java which is why it got a -1. Don't worry about fixing this for now. I'll be fixing a number of these before we release 0.16.
I just added our the release audit to the patch process. It looks
for an increase in the number of files that don't have property license headers. This patch is missing one for src/java/org/apache/ hadoop/net/ScriptBasedMapping.java which is why it got a -1. Don't worry about fixing this for now. I'll be fixing a number of these before we release 0.16. The javadoc for hadoop.net.Node.setNodeLocation has a cut and paste error.
It would be good to have a single factory method for creating the DNSToSwitchMapping object from a configuration. The check for running under junit is problematic, because it means that we get different behavior in unit tests than in production. The namenode thread that does the rack resolution in the background means that data nodes will change locations a bit after they join the network. That seems problematic. I'm still going through the patch, but I wanted to get some of my comments down into jira. I think it's too late in the game to push this into 0.16. It's a 100k+ patch which introduces pretty pervasive changes. It it destabilizes the trunk it'll take another week or two to stabilize and branch 0.16. I propose we push this to 0.17
A few comments related to HDFS changes:
0. This implementation could introduce more delay in the Namenode startup time. Reducing the namenode startup time has been a key concern in recent times. Maybe the ResolutionThread can invoke the resolution script with a bunch of datanodes (rather than one datanode at a time).
True. But the reason for that is to handle the DNS resolution issue (since i needed the behavior for writing testcases that simulates racks/hosts on a single machine, without changing too much of framework code just for getting the testcases to work). Is there a better approach for the dns thing in Java (the main requirement is that all host resolutions like foo.com resolves to 127.0.0.1)? Let me check. Just like on datanode, you could have different behaviour on Namenode only for RackAwareness test. Currently this patch has different behaviour in registerDatanode() for all the tests. You could use the shostame variable "slave.host.name".
Not sure if the following would work, but ...
Have you considered replacing the SocketImplFactory of the java.net.Socket with a implementation that returns a PlainSocketImpl subclass that ignores the given InetAddress and uses always a 127.0.0.1 one? The caveat here is that PlainSocketImpl is a package private JVM class so some hack would have to be done there. If something along this line works then Hadoop code is not different during test and production, you are tapping at the JVM level.
I don't think this is going to be that big a deal if the implementation of the resolution is efficient. At least for now, I'd like to keep the resolve method take one datanode at a time.
This has a potential side effect .. for e.g., in the beginning of the namenode run some dfs blocks will appear to be not present on all the racks & so will it try to re-replicate, etc. ? But that can be special cased as we discussed offline. A few dfs-related comments:
1. A datanode should not be added the cluster map before its network location is resolved; 2. A datanode's block report should not be processed before its network location is resolved. Violating (1) cause the datanode be a candiate target of block allocation; Violating (2) may cause block rereplication or excessive block deletion to wrongly think replicas on the same rack to be on different racks. Both may result all replicas of a block to end up on the same rack. I also think that doing network resolution per datanode is inefficient. It would be nice if NN could have a list or a range of nodes that are alllowed to join the cluster and then resolve their network locations by running a script that takes a list or a range of nodes and returns all racks and the nodes belonged to each rack. Datanode network locations resolutions are performed before accepting any datanode registration. This would eliminate all the problems that are caused by lazy network resolution. I propose the following to avoid corner case HDFS issues and keeping the existing invariant (that all operations like block rereplication/deletion happen knowing the rackid of the datanodes).
1) On namenode startup, the namenode keeps on accepting datanode registrations and lazily resolving them. (as in the patch) 2) Block report is not processed if the datanode in question has not been resolved 3) When the resolution thread gets a chance to run, it resolves all what exists in the queue Makes sense ? Sounds good to me. Regarding step 2: will the block report processing wait till the datanode is resolved, or will it return an RPC error to the datanode?
I am thinking of returning an RPC error (like DNA_UNRESOLVED). That will ensure that RPC handlers are not held up unnecessarily on the namenode while the resolution is in progress (which depends on when the resolution thread gets a chance to run).
Ok, I agree. If this specific error could be more generic (e.g. RETRY with a retry time specified by the server) that will be great.
>1) On namenode startup, the namenode keeps on accepting datanode registrations and lazily resolving them. (as in the patch) Throwing a block report away because the NN has not figured out the DN's rack location is an expensive proposition. Looks like we need to separate (a) and (b). Perhaps: [DNA_REGISTER_AND_BR] – i don't see how a combined command would be used. I like sanjay's proposal. If the namenode receives a block report but the datanode's hostname has not been resolved yet, then the NN sets a bit in the DatanodeDescriptor that indicates that the namenode should request a block report. The NN then dumps the contents of this block report on the floor but returns success to the block-report RPC.
Finally, when the name of the datanode is resolved and if this new bit is set, then the NN sends a request-for-block-report request as part of the response to a heartbeat from that datanode. Attaching one version of the revised patch. The changes from the previous patch are:
1) The conditional code depending on whether the framework is running under junit has been removed. 2) NetUtils has some APIs to store & retrieve static resolutions (foo.bar.com -> localhost). This comes handy for tests to do with racks. 3) On the DFS side of things, a new command DNA_BLOCKREPORT has been added. Now a datanode will waste at most one block report. It can potentially succeed in its first block report (if the namenode finished resolving the rackID for the datanode before that). 4) Batch resolution API has been added to DNSToSwitchMapping interface. This patch passes 'ant test'. I am in the process of doing testing on a big cluster. Pushing the latest patch through hudson
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12375946/1985.v20.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 20 new or modified tests. javadoc -1. The javadoc tool appears to have generated 1 warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to introduce 3 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1816/testReport/ This message is automatically generated. Two minor comments:
1. DatanodeProtocol.versionId should be 12 (instead of 13) Now, I have another (and much wider) question about deployment. This seems to be an INCOMPATIBLE change to me. If we upgrade an existing cluster with this patch, then the script that determines rack locations have to be moved from the datanode to the namenode. I understand that keeping backward compatibility means that we have to keep lots of old code around for a while. Do you have any ideas in this regard? One option would be to keep the "-r" command line option in the Datanode and print a warning message saying "Deprecated option: move this functionality to the namenode". Yes, this is really an INCOMPATIBLE change. If this is indeed a concern, then what you suggest to do with keeping "-r" option makes sense. My worry is that we will end up having two ways of resolving things in the framework since the MapReduce side of things will use the new approach. The other issue is this that the separation of deprecated and new code won't be that distinct (as in the case of deprecation of an API where we invoke the new method implementation inside the deprecated method) since the patch touches a bunch of files at the level of implementation of methods. But if deprecation is a real concern, I will address it.
I think in this case, supporting the backwards computability is prohibitive. We should issue an error if they try to use the old setup and put it in the incompatible changes section. But I think that it is ok.
This patch addresses the latest hudson issues (except one findbugs issue which can be ignored). It also addresses the dfs comment by Dhruba on moving the DatanodeDescriptor code to FSNameSystem.java and the DatanodeProtocol version value, and throws an error if the old rack setup is used.
1. you should document that the output of resolve is "/rack" instead of "/rack/host"
2. remove the resolve(String) and leave the resolve(List<String>) 3. Should put the RawSplit into the TaskInProgress rather than the BytesWritable class name, and locations 4. The cache should be an identity hash map 5. Building the cache builds a list of the tips on each rack and uses List.contains before adding each tip. Since each rack may have 30k or more tips on it, this will be slow. Note on the last one, since the number of locations per a tip will likely be small, you can put the racks for a given tip into a set and iterate through the set.
Actually it is easier than that .. only one check is required without adding extra space per a TIP. If the last element of the cache in question is the TIP that we are trying to insert, we don't insert... Here is how it looks like - consider the earlier inner 'for' loop inside createCache. The modified code there:
+ if (hostMaps == null) {
+ hostMaps = new ArrayList<TaskInProgress>();
+ cache.put(node, hostMaps);
+ hostMaps.add(maps[i]);
+ }
+ //check whether the hostMaps already contains an entry for a TIP
+ //This will be true for nodes that are racks and multiple nodes in
+ //the rack contain the input for a tip. Note that if it already
+ //exists in the hostMaps, it must be the last element there since
+ //we process one TIP at a time sequentially in the split-size order
+ if (hostMaps.get(hostMaps.size() - 1) != maps[i]) {
+ hostMaps.add(maps[i]);
+ }
This addresses Owen's comments
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12376595/1985.v24.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 20 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to introduce 1 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1842/testReport/ This message is automatically generated. The one and only findbugs warning can be ignored.
The last patch went out-of-sync with the trunk. The update to the last patch is very trivial though, and I validated the hudson tests on my local machine.
Integrated in Hadoop-trunk #415 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/415/
Noted as incompatible in changes.txt
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
yes, DNS name (hostname) to switch id mapping should be managed just like the hostname to IP mapping.
The info should be available to the DFS namenode, datanodes and applications in the same way.
Job tracker uses this info for task assignment. In general, DFS client should also use this info to decide where to fetch a needed block.