Hadoop Common
  1. Hadoop Common
  2. HADOOP-1985

Abstract node to switch mapping into a topology service class used by namenode and jobtracker

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Hide
      This issue introduces rack awareness for map tasks. It also moves the rack resolution logic to the central servers - NameNode & JobTracker. The administrator can specify a loadable class given by topology.node.switch.mapping.impl to specify the class implementing the logic for rack resolution. The class must implement a method - resolve(List<String> names), where names is the list of DNS-names/IP-addresses that we want resolved. The return value is a list of resolved network paths of the form /foo/rack, where rack is the rackID where the node belongs to and foo is the switch where multiple racks are connected, and so on. The default implementation of this class is packaged along with hadoop and points to org.apache.hadoop.net.ScriptBasedMapping and this class loads a script that can be used for rack resolution. The script location is configurable. It is specified by topology.script.file.name and defaults to an empty script. In the case where the script name is empty, /default-rack is returned for all dns-names/IP-addresses. The loadable topology.node.switch.mapping.impl provides administrators fleixibilty to define how their site's node resolution should happen.
      For mapred, one can also specify the level of the cache w.r.t the number of levels in the resolved network path - defaults to two. This means that the JobTracker will cache tasks at the host level and at the rack level.
      Known issue: the task caching will not work with levels greater than 2 (beyond racks). This bug is tracked in HADOOP-3296.
      Show
      This issue introduces rack awareness for map tasks. It also moves the rack resolution logic to the central servers - NameNode & JobTracker. The administrator can specify a loadable class given by topology.node.switch.mapping.impl to specify the class implementing the logic for rack resolution. The class must implement a method - resolve(List<String> names), where names is the list of DNS-names/IP-addresses that we want resolved. The return value is a list of resolved network paths of the form /foo/rack, where rack is the rackID where the node belongs to and foo is the switch where multiple racks are connected, and so on. The default implementation of this class is packaged along with hadoop and points to org.apache.hadoop.net.ScriptBasedMapping and this class loads a script that can be used for rack resolution. The script location is configurable. It is specified by topology.script.file.name and defaults to an empty script. In the case where the script name is empty, /default-rack is returned for all dns-names/IP-addresses. The loadable topology.node.switch.mapping.impl provides administrators fleixibilty to define how their site's node resolution should happen. For mapred, one can also specify the level of the cache w.r.t the number of levels in the resolved network path - defaults to two. This means that the JobTracker will cache tasks at the host level and at the rack level. Known issue: the task caching will not work with levels greater than 2 (beyond racks). This bug is tracked in HADOOP-3296 .

      Description

      In order to implement switch locality in MapReduce, we need to have switch location in both the namenode and job tracker. Currently the namenode asks the data nodes for this info and they run a local script to answer this question. In our environment and others that I know of there is no reason to push this to each node. It is easier to maintain a centralized script that maps node DNS names to switch strings.

      I propose that we build a new class that caches known DNS name to switch mappings and invokes a loadable class or a configurable system call to resolve unknown DNS to switch mappings. We can then add this to the namenode to support the current block to switch mapping needs and simplify the data nodes. We can also add this same callout to the job tracker and then implement rack locality logic there without needing to chane the filesystem API or the split planning API.

      Not only is this the least intrusive path to building racklocal MR I can ID, it is also future compatible to future infrastructures that may derive topology on the fly, etc, etc...

      1. 1985.v25.patch
        101 kB
        Devaraj Das
      2. 1985.v24.patch
        101 kB
        Devaraj Das
      3. 1985.v23.patch
        100 kB
        Devaraj Das
      4. 1985.v20.patch
        100 kB
        Devaraj Das
      5. 1985.v19.patch
        99 kB
        Devaraj Das
      6. 1985.v11.patch
        88 kB
        Devaraj Das
      7. 1985.v10.patch
        81 kB
        Devaraj Das
      8. 1985.v9.patch
        82 kB
        Devaraj Das
      9. jobinprogress.patch
        11 kB
        Devaraj Das
      10. 1985.v6.patch
        83 kB
        Devaraj Das
      11. 1985.v5.patch
        78 kB
        Devaraj Das
      12. 1985.v4.patch
        78 kB
        Devaraj Das
      13. 1985.v3.patch
        81 kB
        Devaraj Das
      14. 1985.v2.patch
        80 kB
        Devaraj Das
      15. 1985.v1.patch
        78 kB
        Devaraj Das
      16. 1985.new.patch
        27 kB
        Devaraj Das

        Issue Links

          Activity

          Gavin made changes -
          Link This issue is depended upon by MAPREDUCE-267 [ MAPREDUCE-267 ]
          Gavin made changes -
          Link This issue blocks MAPREDUCE-267 [ MAPREDUCE-267 ]
          Gavin made changes -
          Link This issue is depended upon by HADOOP-2119 [ HADOOP-2119 ]
          Gavin made changes -
          Link This issue blocks HADOOP-2119 [ HADOOP-2119 ]
          steve_l made changes -
          Link This issue is related to HDFS-891 [ HDFS-891 ]
          Owen O'Malley made changes -
          Component/s mapred [ 12310690 ]
          Owen O'Malley made changes -
          Component/s dfs [ 12310710 ]
          Nigel Daley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Devaraj Das made changes -
          Release Note This issue introduces rack awareness for map tasks. It also moves the rack resolution logic to the central servers - NameNode & JobTracker. The administrator can specify a loadable class given by topology.node.switch.mapping.impl to specify the class implementing the logic for rack resolution. The class must implement a method - resolve(List<String> names), where names is the list of DNS-names/IP-addresses that we want resolved. The return value is a list of resolved network paths of the form /foo/rack, where rack is the rackID where the node belongs to and foo is the switch where multiple racks are connected, and so on. The default implementation of this class is packaged along with hadoop and points to org.apache.hadoop.net.ScriptBasedMapping and this class loads a script that can be used for rack resolution. The script location is configurable. It is specified by topology.script.file.name and defaults to an empty script. In the case where the script name is empty, /default-rack is returned for all dns-names/IP-addresses. The loadable topology.node.switch.mapping.impl provides administrators fleixibilty to define how their site's node resolution should happen.
          For mapred, one can also specify the level of the cache w.r.t the number of levels in the resolved network path - defaults to two. This means that the JobTracker will cache tasks at the host level and at the rack level.
          This issue introduces rack awareness for map tasks. It also moves the rack resolution logic to the central servers - NameNode & JobTracker. The administrator can specify a loadable class given by topology.node.switch.mapping.impl to specify the class implementing the logic for rack resolution. The class must implement a method - resolve(List<String> names), where names is the list of DNS-names/IP-addresses that we want resolved. The return value is a list of resolved network paths of the form /foo/rack, where rack is the rackID where the node belongs to and foo is the switch where multiple racks are connected, and so on. The default implementation of this class is packaged along with hadoop and points to org.apache.hadoop.net.ScriptBasedMapping and this class loads a script that can be used for rack resolution. The script location is configurable. It is specified by topology.script.file.name and defaults to an empty script. In the case where the script name is empty, /default-rack is returned for all dns-names/IP-addresses. The loadable topology.node.switch.mapping.impl provides administrators fleixibilty to define how their site's node resolution should happen.
          For mapred, one can also specify the level of the cache w.r.t the number of levels in the resolved network path - defaults to two. This means that the JobTracker will cache tasks at the host level and at the rack level.
          Known issue: the task caching will not work with levels greater than 2 (beyond racks). This bug is tracked in HADOOP-3296.
          Hadoop Flags [Reviewed, Incompatible change] [Incompatible change, Reviewed]
          Devaraj Das made changes -
          Hadoop Flags [Incompatible change] [Incompatible change, Reviewed]
          Release Note This issue introduces rack awareness for map tasks. It also moves the rack resolution logic to the central servers - NameNode & JobTracker. The administrator can specify a loadable class given by topology.node.switch.mapping.impl to specify the class implementing the logic for rack resolution. The class must implement a method - resolve(List<String> names), where names is the list of DNS-names/IP-addresses that we want resolved. The return value is a list of resolved network paths of the form /foo/rack, where rack is the rackID where the node belongs to and foo is the switch where multiple racks are connected, and so on. The default implementation of this class is packaged along with hadoop and points to org.apache.hadoop.net.ScriptBasedMapping and this class loads a script that can be used for rack resolution. The script location is configurable. It is specified by topology.script.file.name and defaults to an empty script. In the case where the script name is empty, /default-rack is returned for all dns-names/IP-addresses. The loadable topology.node.switch.mapping.impl provides administrators fleixibilty to define how their site's node resolution should happen.
          For mapred, one can also specify the level of the cache w.r.t the number of levels in the resolved network path - defaults to two. This means that the JobTracker will cache tasks at the host level and at the rack level.
          Robert Chansler made changes -
          Hadoop Flags [Incompatible change]
          Hide
          Robert Chansler added a comment -

          Noted as incompatible in changes.txt

          Show
          Robert Chansler added a comment - Noted as incompatible in changes.txt
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #415 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/415/ )
          Devaraj Das made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Devaraj Das added a comment -

          I just committed this.

          Show
          Devaraj Das added a comment - I just committed this.
          Devaraj Das made changes -
          Attachment 1985.v25.patch [ 12376740 ]
          Hide
          Devaraj Das added a comment -

          The last patch went out-of-sync with the trunk. The update to the last patch is very trivial though, and I validated the hudson tests on my local machine.

          Show
          Devaraj Das added a comment - The last patch went out-of-sync with the trunk. The update to the last patch is very trivial though, and I validated the hudson tests on my local machine.
          Hide
          Devaraj Das added a comment -

          The one and only findbugs warning can be ignored.

          Show
          Devaraj Das added a comment - The one and only findbugs warning can be ignored.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12376595/1985.v24.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 20 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs -1. The patch appears to introduce 1 new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1842/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1842/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1842/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1842/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12376595/1985.v24.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 20 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to introduce 1 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1842/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1842/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1842/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1842/console This message is automatically generated.
          Devaraj Das made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Devaraj Das made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Devaraj Das made changes -
          Attachment 1985.v24.patch [ 12376595 ]
          Hide
          Devaraj Das added a comment -

          This addresses Owen's comments

          Show
          Devaraj Das added a comment - This addresses Owen's comments
          Hide
          Devaraj Das added a comment -

          Actually it is easier than that .. only one check is required without adding extra space per a TIP. If the last element of the cache in question is the TIP that we are trying to insert, we don't insert... Here is how it looks like - consider the earlier inner 'for' loop inside createCache. The modified code there:

          +          if (hostMaps == null) {
          +            hostMaps = new ArrayList<TaskInProgress>();
          +            cache.put(node, hostMaps);
          +            hostMaps.add(maps[i]);
          +          }
          +          //check whether the hostMaps already contains an entry for a TIP
          +          //This will be true for nodes that are racks and multiple nodes in
          +          //the rack contain the input for a tip. Note that if it already
          +          //exists in the hostMaps, it must be the last element there since
          +          //we process one TIP at a time sequentially in the split-size order
          +          if (hostMaps.get(hostMaps.size() - 1) != maps[i]) {
          +            hostMaps.add(maps[i]);
          +          }
          
          Show
          Devaraj Das added a comment - Actually it is easier than that .. only one check is required without adding extra space per a TIP. If the last element of the cache in question is the TIP that we are trying to insert, we don't insert... Here is how it looks like - consider the earlier inner 'for' loop inside createCache. The modified code there: + if (hostMaps == null) { + hostMaps = new ArrayList<TaskInProgress>(); + cache.put(node, hostMaps); + hostMaps.add(maps[i]); + } + //check whether the hostMaps already contains an entry for a TIP + //This will be true for nodes that are racks and multiple nodes in + //the rack contain the input for a tip. Note that if it already + //exists in the hostMaps, it must be the last element there since + //we process one TIP at a time sequentially in the split-size order + if (hostMaps.get(hostMaps.size() - 1) != maps[i]) { + hostMaps.add(maps[i]); + }
          Hide
          Owen O'Malley added a comment -

          Note on the last one, since the number of locations per a tip will likely be small, you can put the racks for a given tip into a set and iterate through the set.

          Show
          Owen O'Malley added a comment - Note on the last one, since the number of locations per a tip will likely be small, you can put the racks for a given tip into a set and iterate through the set.
          Hide
          Owen O'Malley added a comment -

          1. you should document that the output of resolve is "/rack" instead of "/rack/host"
          2. remove the resolve(String) and leave the resolve(List<String>)
          3. Should put the RawSplit into the TaskInProgress rather than the BytesWritable
          class name, and locations
          4. The cache should be an identity hash map
          5. Building the cache builds a list of the tips on each rack and uses
          List.contains before adding each tip. Since each rack may have 30k or more
          tips on it, this will be slow.

          Show
          Owen O'Malley added a comment - 1. you should document that the output of resolve is "/rack" instead of "/rack/host" 2. remove the resolve(String) and leave the resolve(List<String>) 3. Should put the RawSplit into the TaskInProgress rather than the BytesWritable class name, and locations 4. The cache should be an identity hash map 5. Building the cache builds a list of the tips on each rack and uses List.contains before adding each tip. Since each rack may have 30k or more tips on it, this will be slow.
          Devaraj Das made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Devaraj Das made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Devaraj Das made changes -
          Attachment 1985.v23.patch [ 12376403 ]
          Hide
          Devaraj Das added a comment - - edited

          This patch addresses the latest hudson issues (except one findbugs issue which can be ignored). It also addresses the dfs comment by Dhruba on moving the DatanodeDescriptor code to FSNameSystem.java and the DatanodeProtocol version value, and throws an error if the old rack setup is used.

          Show
          Devaraj Das added a comment - - edited This patch addresses the latest hudson issues (except one findbugs issue which can be ignored). It also addresses the dfs comment by Dhruba on moving the DatanodeDescriptor code to FSNameSystem.java and the DatanodeProtocol version value, and throws an error if the old rack setup is used.
          Hide
          Owen O'Malley added a comment -

          I think in this case, supporting the backwards computability is prohibitive. We should issue an error if they try to use the old setup and put it in the incompatible changes section. But I think that it is ok.

          Show
          Owen O'Malley added a comment - I think in this case, supporting the backwards computability is prohibitive. We should issue an error if they try to use the old setup and put it in the incompatible changes section. But I think that it is ok.
          Hide
          Devaraj Das added a comment -

          Yes, this is really an INCOMPATIBLE change. If this is indeed a concern, then what you suggest to do with keeping "-r" option makes sense. My worry is that we will end up having two ways of resolving things in the framework since the MapReduce side of things will use the new approach. The other issue is this that the separation of deprecated and new code won't be that distinct (as in the case of deprecation of an API where we invoke the new method implementation inside the deprecated method) since the patch touches a bunch of files at the level of implementation of methods. But if deprecation is a real concern, I will address it.

          Show
          Devaraj Das added a comment - Yes, this is really an INCOMPATIBLE change. If this is indeed a concern, then what you suggest to do with keeping "-r" option makes sense. My worry is that we will end up having two ways of resolving things in the framework since the MapReduce side of things will use the new approach. The other issue is this that the separation of deprecated and new code won't be that distinct (as in the case of deprecation of an API where we invoke the new method implementation inside the deprecated method) since the patch touches a bunch of files at the level of implementation of methods. But if deprecation is a real concern, I will address it.
          Hide
          dhruba borthakur added a comment -

          Two minor comments:

          1. DatanodeProtocol.versionId should be 12 (instead of 13)
          2. The DatanodeDescriptor objects should not be accessed from NameNode.java. Instead they should be accessed only in FSNamesystem.java because they are protected by the lock on the FSNamesystem object.

          Now, I have another (and much wider) question about deployment. This seems to be an INCOMPATIBLE change to me. If we upgrade an existing cluster with this patch, then the script that determines rack locations have to be moved from the datanode to the namenode. I understand that keeping backward compatibility means that we have to keep lots of old code around for a while. Do you have any ideas in this regard? One option would be to keep the "-r" command line option in the Datanode and print a warning message saying "Deprecated option: move this functionality to the namenode".

          Show
          dhruba borthakur added a comment - Two minor comments: 1. DatanodeProtocol.versionId should be 12 (instead of 13) 2. The DatanodeDescriptor objects should not be accessed from NameNode.java. Instead they should be accessed only in FSNamesystem.java because they are protected by the lock on the FSNamesystem object. Now, I have another (and much wider) question about deployment. This seems to be an INCOMPATIBLE change to me. If we upgrade an existing cluster with this patch, then the script that determines rack locations have to be moved from the datanode to the namenode. I understand that keeping backward compatibility means that we have to keep lots of old code around for a while. Do you have any ideas in this regard? One option would be to keep the "-r" command line option in the Datanode and print a warning message saying "Deprecated option: move this functionality to the namenode".
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12375946/1985.v20.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 20 new or modified tests.

          javadoc -1. The javadoc tool appears to have generated 1 warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs -1. The patch appears to introduce 3 new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1816/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1816/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1816/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1816/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12375946/1985.v20.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 20 new or modified tests. javadoc -1. The javadoc tool appears to have generated 1 warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to introduce 3 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1816/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1816/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1816/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1816/console This message is automatically generated.
          Devaraj Das made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Devaraj Das added a comment -

          Pushing the latest patch through hudson

          Show
          Devaraj Das added a comment - Pushing the latest patch through hudson
          Devaraj Das made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Devaraj Das made changes -
          Attachment 1985.v20.patch [ 12375946 ]
          Hide
          Devaraj Das added a comment -

          This is a tested patch.

          Show
          Devaraj Das added a comment - This is a tested patch.
          Devaraj Das made changes -
          Attachment 1985.v19.patch [ 12375651 ]
          Hide
          Devaraj Das added a comment -

          Attaching one version of the revised patch. The changes from the previous patch are:
          1) The conditional code depending on whether the framework is running under junit has been removed.
          2) NetUtils has some APIs to store & retrieve static resolutions (foo.bar.com -> localhost). This comes handy for tests to do with racks.
          3) On the DFS side of things, a new command DNA_BLOCKREPORT has been added. Now a datanode will waste at most one block report. It can potentially succeed in its first block report (if the namenode finished resolving the rackID for the datanode before that).
          4) Batch resolution API has been added to DNSToSwitchMapping interface.

          This patch passes 'ant test'. I am in the process of doing testing on a big cluster.

          Show
          Devaraj Das added a comment - Attaching one version of the revised patch. The changes from the previous patch are: 1) The conditional code depending on whether the framework is running under junit has been removed. 2) NetUtils has some APIs to store & retrieve static resolutions (foo.bar.com -> localhost). This comes handy for tests to do with racks. 3) On the DFS side of things, a new command DNA_BLOCKREPORT has been added. Now a datanode will waste at most one block report. It can potentially succeed in its first block report (if the namenode finished resolving the rackID for the datanode before that). 4) Batch resolution API has been added to DNSToSwitchMapping interface. This patch passes 'ant test'. I am in the process of doing testing on a big cluster.
          Amar Kamat made changes -
          Link This issue blocks HADOOP-2812 [ HADOOP-2812 ]
          Hide
          dhruba borthakur added a comment -

          I like sanjay's proposal. If the namenode receives a block report but the datanode's hostname has not been resolved yet, then the NN sets a bit in the DatanodeDescriptor that indicates that the namenode should request a block report. The NN then dumps the contents of this block report on the floor but returns success to the block-report RPC.

          Finally, when the name of the datanode is resolved and if this new bit is set, then the NN sends a request-for-block-report request as part of the response to a heartbeat from that datanode.

          Show
          dhruba borthakur added a comment - I like sanjay's proposal. If the namenode receives a block report but the datanode's hostname has not been resolved yet, then the NN sets a bit in the DatanodeDescriptor that indicates that the namenode should request a block report. The NN then dumps the contents of this block report on the floor but returns success to the block-report RPC. Finally, when the name of the datanode is resolved and if this new bit is set, then the NN sends a request-for-block-report request as part of the response to a heartbeat from that datanode.
          Hide
          Sanjay Radia added a comment -

          >1) On namenode startup, the namenode keeps on accepting datanode registrations and lazily resolving them. (as in the patch)
          >2) Block report is not processed if the datanode in question has not been resolved

          Throwing a block report away because the NN has not figured out the DN's rack location is an expensive proposition.
          The reason being that block reports generate a lot of memory allocations and GC.
          Repeated BRs cause the startup time of the NN to increase.
          Hence when the registration comes in and if the rack location of the DN has not been determined, the NN should NOT request a
          block report. The block report should be requested at a future heartbeat when the rack location has been determined.
          The problem is that a reply of "DNA Register" is interpreted as
          a) please send registration
          b) please send BR (my recent patch added a random delay between steps a and b)

          Looks like we need to separate (a) and (b).
          This would require a new NN-DN command.

          Perhaps:
          DNA_REGISTER
          DNA_BR

          [DNA_REGISTER_AND_BR] – i don't see how a combined command would be used.
          Q. Does rack location determination occur AFTER the registration?

          Show
          Sanjay Radia added a comment - >1) On namenode startup, the namenode keeps on accepting datanode registrations and lazily resolving them. (as in the patch) >2) Block report is not processed if the datanode in question has not been resolved Throwing a block report away because the NN has not figured out the DN's rack location is an expensive proposition. The reason being that block reports generate a lot of memory allocations and GC. Repeated BRs cause the startup time of the NN to increase. Hence when the registration comes in and if the rack location of the DN has not been determined, the NN should NOT request a block report. The block report should be requested at a future heartbeat when the rack location has been determined. The problem is that a reply of "DNA Register" is interpreted as a) please send registration b) please send BR (my recent patch added a random delay between steps a and b) Looks like we need to separate (a) and (b). This would require a new NN-DN command. Perhaps: DNA_REGISTER DNA_BR [DNA_REGISTER_AND_BR] – i don't see how a combined command would be used. Q. Does rack location determination occur AFTER the registration?
          Amar Kamat made changes -
          Link This issue blocks HADOOP-2729 [ HADOOP-2729 ]
          Hide
          dhruba borthakur added a comment -

          Ok, I agree. If this specific error could be more generic (e.g. RETRY with a retry time specified by the server) that will be great.

          Show
          dhruba borthakur added a comment - Ok, I agree. If this specific error could be more generic (e.g. RETRY with a retry time specified by the server) that will be great.
          Hide
          Devaraj Das added a comment - - edited

          I am thinking of returning an RPC error (like DNA_UNRESOLVED). That will ensure that RPC handlers are not held up unnecessarily on the namenode while the resolution is in progress (which depends on when the resolution thread gets a chance to run).

          Show
          Devaraj Das added a comment - - edited I am thinking of returning an RPC error (like DNA_UNRESOLVED). That will ensure that RPC handlers are not held up unnecessarily on the namenode while the resolution is in progress (which depends on when the resolution thread gets a chance to run).
          Hide
          dhruba borthakur added a comment -

          Sounds good to me. Regarding step 2: will the block report processing wait till the datanode is resolved, or will it return an RPC error to the datanode?

          Show
          dhruba borthakur added a comment - Sounds good to me. Regarding step 2: will the block report processing wait till the datanode is resolved, or will it return an RPC error to the datanode?
          Hide
          Devaraj Das added a comment -

          I propose the following to avoid corner case HDFS issues and keeping the existing invariant (that all operations like block rereplication/deletion happen knowing the rackid of the datanodes).
          1) On namenode startup, the namenode keeps on accepting datanode registrations and lazily resolving them. (as in the patch)
          2) Block report is not processed if the datanode in question has not been resolved
          3) When the resolution thread gets a chance to run, it resolves all what exists in the queue

          Makes sense ?

          Show
          Devaraj Das added a comment - I propose the following to avoid corner case HDFS issues and keeping the existing invariant (that all operations like block rereplication/deletion happen knowing the rackid of the datanodes). 1) On namenode startup, the namenode keeps on accepting datanode registrations and lazily resolving them. (as in the patch) 2) Block report is not processed if the datanode in question has not been resolved 3) When the resolution thread gets a chance to run, it resolves all what exists in the queue Makes sense ?
          Hide
          Hairong Kuang added a comment -

          A few dfs-related comments:
          1. A datanode should not be added the cluster map before its network location is resolved;
          2. A datanode's block report should not be processed before its network location is resolved.
          Violating (1) cause the datanode be a candiate target of block allocation; Violating (2) may cause block rereplication or excessive block deletion to wrongly think replicas on the same rack to be on different racks. Both may result all replicas of a block to end up on the same rack.

          I also think that doing network resolution per datanode is inefficient. It would be nice if NN could have a list or a range of nodes that are alllowed to join the cluster and then resolve their network locations by running a script that takes a list or a range of nodes and returns all racks and the nodes belonged to each rack. Datanode network locations resolutions are performed before accepting any datanode registration. This would eliminate all the problems that are caused by lazy network resolution.

          Show
          Hairong Kuang added a comment - A few dfs-related comments: 1. A datanode should not be added the cluster map before its network location is resolved; 2. A datanode's block report should not be processed before its network location is resolved. Violating (1) cause the datanode be a candiate target of block allocation; Violating (2) may cause block rereplication or excessive block deletion to wrongly think replicas on the same rack to be on different racks. Both may result all replicas of a block to end up on the same rack. I also think that doing network resolution per datanode is inefficient. It would be nice if NN could have a list or a range of nodes that are alllowed to join the cluster and then resolve their network locations by running a script that takes a list or a range of nodes and returns all racks and the nodes belonged to each rack. Datanode network locations resolutions are performed before accepting any datanode registration. This would eliminate all the problems that are caused by lazy network resolution.
          Hide
          Devaraj Das added a comment -

          This implementation could introduce more delay in the Namenode startup time. Maybe the ResolutionThread can invoke the resolution script with a bunch of datanodes (rather than one datanode at a time).

          I don't think this is going to be that big a deal if the implementation of the resolution is efficient. At least for now, I'd like to keep the resolve method take one datanode at a time.

          When the ResolutionThread operates on a datanode, it updates the networkLocation field of the datanode. This means that if a datanode is used before the resolution thread gets to it, it might return a default location, but this is ok and acceptable. If we adopt this approach, then the Resolution could be done much lazily, in fact, even towards the end of the SafeMode period, thereby reducing namenode restart times.

          This has a potential side effect .. for e.g., in the beginning of the namenode run some dfs blocks will appear to be not present on all the racks & so will it try to re-replicate, etc. ? But that can be special cased as we discussed offline.

          Show
          Devaraj Das added a comment - This implementation could introduce more delay in the Namenode startup time. Maybe the ResolutionThread can invoke the resolution script with a bunch of datanodes (rather than one datanode at a time). I don't think this is going to be that big a deal if the implementation of the resolution is efficient. At least for now, I'd like to keep the resolve method take one datanode at a time. When the ResolutionThread operates on a datanode, it updates the networkLocation field of the datanode. This means that if a datanode is used before the resolution thread gets to it, it might return a default location, but this is ok and acceptable. If we adopt this approach, then the Resolution could be done much lazily, in fact, even towards the end of the SafeMode period, thereby reducing namenode restart times. This has a potential side effect .. for e.g., in the beginning of the namenode run some dfs blocks will appear to be not present on all the racks & so will it try to re-replicate, etc. ? But that can be special cased as we discussed offline.
          Hide
          Alejandro Abdelnur added a comment -

          Not sure if the following would work, but ...

          Have you considered replacing the SocketImplFactory of the java.net.Socket with a implementation that returns a PlainSocketImpl subclass that ignores the given InetAddress and uses always a 127.0.0.1 one?

          The caveat here is that PlainSocketImpl is a package private JVM class so some hack would have to be done there.

          If something along this line works then Hadoop code is not different during test and production, you are tapping at the JVM level.

          Show
          Alejandro Abdelnur added a comment - Not sure if the following would work, but ... Have you considered replacing the SocketImplFactory of the java.net.Socket with a implementation that returns a PlainSocketImpl subclass that ignores the given InetAddress and uses always a 127.0.0.1 one? The caveat here is that PlainSocketImpl is a package private JVM class so some hack would have to be done there. If something along this line works then Hadoop code is not different during test and production, you are tapping at the JVM level.
          Hide
          Raghu Angadi added a comment -

          Just like on datanode, you could have different behaviour on Namenode only for RackAwareness test. Currently this patch has different behaviour in registerDatanode() for all the tests. You could use the shostame variable "slave.host.name".

          Show
          Raghu Angadi added a comment - Just like on datanode, you could have different behaviour on Namenode only for RackAwareness test. Currently this patch has different behaviour in registerDatanode() for all the tests. You could use the shostame variable "slave.host.name".
          Hide
          Devaraj Das added a comment -

          The check for running under junit is problematic, because it means that we get different behavior in unit tests than in production

          True. But the reason for that is to handle the DNS resolution issue (since i needed the behavior for writing testcases that simulates racks/hosts on a single machine, without changing too much of framework code just for getting the testcases to work). Is there a better approach for the dns thing in Java (the main requirement is that all host resolutions like foo.com resolves to 127.0.0.1)? Let me check.

          Show
          Devaraj Das added a comment - The check for running under junit is problematic, because it means that we get different behavior in unit tests than in production True. But the reason for that is to handle the DNS resolution issue (since i needed the behavior for writing testcases that simulates racks/hosts on a single machine, without changing too much of framework code just for getting the testcases to work). Is there a better approach for the dns thing in Java (the main requirement is that all host resolutions like foo.com resolves to 127.0.0.1)? Let me check.
          Hide
          dhruba borthakur added a comment -

          A few comments related to HDFS changes:

          0. This implementation could introduce more delay in the Namenode startup time. Reducing the namenode startup time has been a key concern in recent times. Maybe the ResolutionThread can invoke the resolution script with a bunch of datanodes (rather than one datanode at a time).
          1. DatanodeProtocol.java: typo "registraction" instead of registration
          2. The ResolutionThread should be renamed as "ResolutionMonitor" just to keep conformity with other static threads in FSNamesystem. Also, the thread should be created in FSNamesystem.initialize() just to keep conformity with the remainder of the code. The other threads (e.g. ReplicationMonitor) are created as Daemon, so shud be this new one too.
          3. Typo in FSNamesystem: "NSToSwitchMapping reolution Thread"
          4. If my memory serves me right, the current implementation depends on the fact that all the datanodes that are in FSNamesystem.datanodeMap should exist in FSNamesystem.host2DataNodeMap as well as in FSNamesystem.clusterMap. If we want to keep this invariant, then FSNamesystem.registerDatanode should insert the datanode in clusterMap (with a default rack setting). It will also insert the datanode in queue for the ResolutionThread. When the ResolutionThread operates on a datanode, it updates the networkLocation field of the datanode. This means that if a datanode is used before the resolution thread gets to it, it might return a default location, but this is ok and acceptable. If we adopt this approach, then the Resolution could be done much lazily, in fact, even towards the end of the SafeMode period, thereby reducing namenode restart times.

          Show
          dhruba borthakur added a comment - A few comments related to HDFS changes: 0. This implementation could introduce more delay in the Namenode startup time. Reducing the namenode startup time has been a key concern in recent times. Maybe the ResolutionThread can invoke the resolution script with a bunch of datanodes (rather than one datanode at a time). 1. DatanodeProtocol.java: typo "registraction" instead of registration 2. The ResolutionThread should be renamed as "ResolutionMonitor" just to keep conformity with other static threads in FSNamesystem. Also, the thread should be created in FSNamesystem.initialize() just to keep conformity with the remainder of the code. The other threads (e.g. ReplicationMonitor) are created as Daemon, so shud be this new one too. 3. Typo in FSNamesystem: "NSToSwitchMapping reolution Thread" 4. If my memory serves me right, the current implementation depends on the fact that all the datanodes that are in FSNamesystem.datanodeMap should exist in FSNamesystem.host2DataNodeMap as well as in FSNamesystem.clusterMap. If we want to keep this invariant, then FSNamesystem.registerDatanode should insert the datanode in clusterMap (with a default rack setting). It will also insert the datanode in queue for the ResolutionThread. When the ResolutionThread operates on a datanode, it updates the networkLocation field of the datanode. This means that if a datanode is used before the resolution thread gets to it, it might return a default location, but this is ok and acceptable. If we adopt this approach, then the Resolution could be done much lazily, in fact, even towards the end of the SafeMode period, thereby reducing namenode restart times.
          Nigel Daley made changes -
          Fix Version/s 0.17.0 [ 12312913 ]
          Fix Version/s 0.16.0 [ 12312740 ]
          Hide
          Nigel Daley added a comment -

          +1 to moving this to 0.17.

          Show
          Nigel Daley added a comment - +1 to moving this to 0.17.
          Hide
          Sameer Paranjpye added a comment -

          I think it's too late in the game to push this into 0.16. It's a 100k+ patch which introduces pretty pervasive changes. It it destabilizes the trunk it'll take another week or two to stabilize and branch 0.16. I propose we push this to 0.17

          Show
          Sameer Paranjpye added a comment - I think it's too late in the game to push this into 0.16. It's a 100k+ patch which introduces pretty pervasive changes. It it destabilizes the trunk it'll take another week or two to stabilize and branch 0.16. I propose we push this to 0.17
          Hide
          Owen O'Malley added a comment -

          The javadoc for hadoop.net.Node.setNodeLocation has a cut and paste error.

          It would be good to have a single factory method for creating the DNSToSwitchMapping object from a configuration.

          The check for running under junit is problematic, because it means that we get different behavior in unit tests than in production.

          The namenode thread that does the rack resolution in the background means that data nodes will change locations a bit after they join the network. That seems problematic.

          I'm still going through the patch, but I wanted to get some of my comments down into jira.

          Show
          Owen O'Malley added a comment - The javadoc for hadoop.net.Node.setNodeLocation has a cut and paste error. It would be good to have a single factory method for creating the DNSToSwitchMapping object from a configuration. The check for running under junit is problematic, because it means that we get different behavior in unit tests than in production. The namenode thread that does the rack resolution in the background means that data nodes will change locations a bit after they join the network. That seems problematic. I'm still going through the patch, but I wanted to get some of my comments down into jira.
          Amar Kamat made changes -
          Link This issue blocks HADOOP-2119 [ HADOOP-2119 ]
          Hide
          Nigel Daley added a comment -

          I just added our the release audit to the patch process. It looks
          for an increase in the number of files that don't have property
          license headers. This patch is missing one for src/java/org/apache/
          hadoop/net/ScriptBasedMapping.java which is why it got a -1. Don't
          worry about fixing this for now. I'll be fixing a number of these
          before we release 0.16.

          Show
          Nigel Daley added a comment - I just added our the release audit to the patch process. It looks for an increase in the number of files that don't have property license headers. This patch is missing one for src/java/org/apache/ hadoop/net/ScriptBasedMapping.java which is why it got a -1. Don't worry about fixing this for now. I'll be fixing a number of these before we release 0.16.
          Hide
          Nigel Daley added a comment -

          I just added our the release audit to the patch process. It looks for an increase in the number of files that don't have property license headers. This patch is missing one for src/java/org/apache/hadoop/net/ScriptBasedMapping.java which is why it got a -1. Don't worry about fixing this for now. I'll be fixing a number of these before we release 0.16.

          Show
          Nigel Daley added a comment - I just added our the release audit to the patch process. It looks for an increase in the number of files that don't have property license headers. This patch is missing one for src/java/org/apache/hadoop/net/ScriptBasedMapping.java which is why it got a -1. Don't worry about fixing this for now. I'll be fixing a number of these before we release 0.16.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12374045/1985.v11.patch
          against trunk revision 615686.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit -1. The applied patch generated 289 release audit warnings (more than the trunk's current 288 warnings).

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1690/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1690/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1690/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1690/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12374045/1985.v11.patch against trunk revision 615686. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit -1. The applied patch generated 289 release audit warnings (more than the trunk's current 288 warnings). findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1690/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1690/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1690/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1690/console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12374045/1985.v11.patch
          against trunk revision 614721.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1679/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1679/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1679/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1679/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12374045/1985.v11.patch against trunk revision 614721. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1679/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1679/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1679/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1679/console This message is automatically generated.
          Devaraj Das made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Devaraj Das made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Devaraj Das made changes -
          Attachment 1985.v11.patch [ 12374045 ]
          Hide
          Devaraj Das added a comment -

          One more of those occasions when the patch went out-of-sync.

          Show
          Devaraj Das added a comment - One more of those occasions when the patch went out-of-sync.
          Hide
          Devaraj Das added a comment -

          I think the core tests failed due to HADOOP-2680 ("all datanodes are bad" ..). They pass on my machine.

          Show
          Devaraj Das added a comment - I think the core tests failed due to HADOOP-2680 ("all datanodes are bad" ..). They pass on my machine.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12373749/1985.v10.patch
          against trunk revision r614301.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests -1. The patch failed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1677/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1677/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1677/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1677/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12373749/1985.v10.patch against trunk revision r614301. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1677/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1677/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1677/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1677/console This message is automatically generated.
          Devaraj Das made changes -
          Assignee Doug Cutting [ cutting ] Devaraj Das [ devaraj ]
          Devaraj Das made changes -
          Assignee Devaraj Das [ devaraj ] Doug Cutting [ cutting ]
          Status Open [ 1 ] Patch Available [ 10002 ]
          Devaraj Das made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Devaraj Das made changes -
          Attachment 1985.v10.patch [ 12373749 ]
          Hide
          Devaraj Das added a comment -

          Thanks Nigel for pointing out that doc build failure might be incorrectly reported as a core-tests failure (we should address this issue). There was a problem in the forrest doc in the patch. This patch fixes that.

          Show
          Devaraj Das added a comment - Thanks Nigel for pointing out that doc build failure might be incorrectly reported as a core-tests failure (we should address this issue). There was a problem in the forrest doc in the patch. This patch fixes that.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12373500/1985.v9.patch
          against trunk revision r613499.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1665/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1665/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1665/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1665/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12373500/1985.v9.patch against trunk revision r613499. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1665/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1665/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1665/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1665/console This message is automatically generated.
          Devaraj Das made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Devaraj Das made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Devaraj Das added a comment -

          core-tests seems to have failed. But I am not able to get to what failed using the link http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1648/testReport/ . Also, tests passed on my machine. Cancelling patch to get it through hudson again..

          Show
          Devaraj Das added a comment - core-tests seems to have failed. But I am not able to get to what failed using the link http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1648/testReport/ . Also, tests passed on my machine. Cancelling patch to get it through hudson again..
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12373500/1985.v9.patch
          against trunk revision r613359.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1648/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1648/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1648/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1648/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12373500/1985.v9.patch against trunk revision r613359. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1648/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1648/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1648/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1648/console This message is automatically generated.
          Devaraj Das made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Devaraj Das made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Devaraj Das made changes -
          Attachment 1985.v9.patch [ 12373500 ]
          Hide
          Devaraj Das added a comment -

          Patch attached with review comments incorporated.

          Show
          Devaraj Das added a comment - Patch attached with review comments incorporated.
          Devaraj Das made changes -
          Attachment jobinprogress.patch [ 12373306 ]
          Hide
          Devaraj Das added a comment -

          Here is a patch with some changes in the task cache datastructure. Also, there are changes that tries to ensure that rack (and higher level) locality is preserved for failed/speculative tasks also... It doesn't delete TIPs from a node cache until the TIP is complete or the node happens to be the host itself. The logic is that we should not delete TIPs from the rack level cache to avoid having the speculative/failed TIPs pay a peformance penalty if some other tasktracker from the same rack gets to run that failed/speculative task. We delete TIPs from the host cache since we don't execute the same tip on a host that failed to execute it earlier...
          Would highly appreciate a quick review on this one.

          Show
          Devaraj Das added a comment - Here is a patch with some changes in the task cache datastructure. Also, there are changes that tries to ensure that rack (and higher level) locality is preserved for failed/speculative tasks also... It doesn't delete TIPs from a node cache until the TIP is complete or the node happens to be the host itself. The logic is that we should not delete TIPs from the rack level cache to avoid having the speculative/failed TIPs pay a peformance penalty if some other tasktracker from the same rack gets to run that failed/speculative task. We delete TIPs from the host cache since we don't execute the same tip on a host that failed to execute it earlier... Would highly appreciate a quick review on this one.
          Hide
          Devaraj Das added a comment - - edited

          I'm worried about the time and memory performance of this. Have you run a sort with dfs cluster == map/reduce cluster and compared running times and job tracker memory size? We've already seen cases where the current pollForNewTask causes performance problems...

          I assume you meant findNewTask giving performance problems. To clarify (for the benefit of others), the JobTracker would consume more memory due to two reasons:
          1) The NetworkTopology is created here. This cannot be avoided, right?
          2) Multiple cache levels are maintained. Currently (existing codebase), we maintain only one cache level (host to maps). This patch adds a cache at each level. Level is currently set to two (host, rack) and compile time config, to do efficient look ups. But the caches are just mappings from Node to references to objects in the NetworkTopology. Are you referring to these additional caches when you say memory performance may be a problem?

          The running time performance is helped by the caches. If the TIP is present in some cache the complexity of is O(num-level), since it takes O(1) at each level to find a TIP for a TaskTracker, no? The linear search for TIP (if it was a cache miss), is there even currently(existing codebase). The only additional thing here are the lookups when the level is more than 1.

          I did run the sort with dfs-cluster == map/reduce-cluster and the numbers were very comparable. Nothing concerning there..

          It bothers me that the max levels is hard coded rather than configurable.

          I was thinking that the most typical cases would require just two levels - host, rack, and that's why i made this a compile time constant. But if it makes sense to make that runtime configurable, I can enable that behavior..

          From a style point of view, I probably would have defined a new class rather than use nested java.utils containers like List<Map<Node, List<TaskInProgress>>>. That way if we change the representation later it won't be scattered through the code. In particular, I can imagine wanting to have the data structure be something like: Map<String (rack name), RackInfo> and RackInfo has a Map<String (hostname), List<TaskInProgress> >. Or even more tree-like...

          How about providing get/set APIs to the existing datastructure. The datastructure works for all cases with arbitrary number of levels (host, rack, switch, datacenter,..) (since it is a list of mappings from Node to TIPs). I didn't want to introduce Strings in the mapping since the NetworkTopology provides a Node abstraction for everything. If we went to Strings then we have an additional step of getting the Node from the String name (and vice versa), parsing strings to get to the Node, etc., which can be easily avoided by having the mappings based on Node.

          Did you need to change the definition of findNewTask? I don't see it in the patch.

          Yes, I changed the definition of findNewTask. In the patch look for Find a new task to run. The diff doesn't have the line findNewTask. It just has the comment above it.

          This needs user documentation in forrest.

          I have that in the 1985.v6.patch. Look for cluster_setup.xml and hdfs_design.xml, where I talk about how rack config can be setup. Did you mean something else?

          The java doc on DNSToSwitchMapping.resolve should probably mention that they must cache if their operation is expensive. Although there isn't a way to clear or update that cache, which might be a problem at some point...

          Agreed regarding the documentation on the cache part. The update of the cache could be handled by the implementation of DNSToSwitchMapping, no? I can imagine a case, where the implementation starts a thread that periodically contacts some service and updates its cache. This is transparent to clients calling DNSToSwitchMapping.resolve.

          You don't really need the Scan example, you could use the GenericMRLoadGenerator with a -keepmap of 0.

          Okay.

          In the longer term I think a configured mapping class would be useful. A class named org.apache.hadoop.net.ConfiguredNodeMapping that let you set the mapping in your config.

          In the patch this is handled by a specific implementation of the DNSToSwitchMapping called StaticMapping, and that provides an API to set up the mapping from host to rackid (used in testcases). But I think I should be able to set things in the configuration and StaticMapping could initialize itself with the mapping provided there. I'll look at that.

          Show
          Devaraj Das added a comment - - edited I'm worried about the time and memory performance of this. Have you run a sort with dfs cluster == map/reduce cluster and compared running times and job tracker memory size? We've already seen cases where the current pollForNewTask causes performance problems... I assume you meant findNewTask giving performance problems. To clarify (for the benefit of others), the JobTracker would consume more memory due to two reasons: 1) The NetworkTopology is created here. This cannot be avoided, right? 2) Multiple cache levels are maintained. Currently (existing codebase), we maintain only one cache level (host to maps). This patch adds a cache at each level. Level is currently set to two (host, rack) and compile time config, to do efficient look ups. But the caches are just mappings from Node to references to objects in the NetworkTopology. Are you referring to these additional caches when you say memory performance may be a problem? The running time performance is helped by the caches. If the TIP is present in some cache the complexity of is O(num-level), since it takes O(1) at each level to find a TIP for a TaskTracker, no? The linear search for TIP (if it was a cache miss), is there even currently(existing codebase). The only additional thing here are the lookups when the level is more than 1. I did run the sort with dfs-cluster == map/reduce-cluster and the numbers were very comparable. Nothing concerning there.. It bothers me that the max levels is hard coded rather than configurable. I was thinking that the most typical cases would require just two levels - host, rack, and that's why i made this a compile time constant. But if it makes sense to make that runtime configurable, I can enable that behavior.. From a style point of view, I probably would have defined a new class rather than use nested java.utils containers like List<Map<Node, List<TaskInProgress>>>. That way if we change the representation later it won't be scattered through the code. In particular, I can imagine wanting to have the data structure be something like: Map<String (rack name), RackInfo> and RackInfo has a Map<String (hostname), List<TaskInProgress> >. Or even more tree-like... How about providing get/set APIs to the existing datastructure. The datastructure works for all cases with arbitrary number of levels (host, rack, switch, datacenter,..) (since it is a list of mappings from Node to TIPs). I didn't want to introduce Strings in the mapping since the NetworkTopology provides a Node abstraction for everything. If we went to Strings then we have an additional step of getting the Node from the String name (and vice versa), parsing strings to get to the Node, etc., which can be easily avoided by having the mappings based on Node. Did you need to change the definition of findNewTask? I don't see it in the patch. Yes, I changed the definition of findNewTask. In the patch look for Find a new task to run. The diff doesn't have the line findNewTask . It just has the comment above it. This needs user documentation in forrest. I have that in the 1985.v6.patch. Look for cluster_setup.xml and hdfs_design.xml, where I talk about how rack config can be setup. Did you mean something else? The java doc on DNSToSwitchMapping.resolve should probably mention that they must cache if their operation is expensive. Although there isn't a way to clear or update that cache, which might be a problem at some point... Agreed regarding the documentation on the cache part. The update of the cache could be handled by the implementation of DNSToSwitchMapping, no? I can imagine a case, where the implementation starts a thread that periodically contacts some service and updates its cache. This is transparent to clients calling DNSToSwitchMapping.resolve. You don't really need the Scan example, you could use the GenericMRLoadGenerator with a -keepmap of 0. Okay. In the longer term I think a configured mapping class would be useful. A class named org.apache.hadoop.net.ConfiguredNodeMapping that let you set the mapping in your config. In the patch this is handled by a specific implementation of the DNSToSwitchMapping called StaticMapping, and that provides an API to set up the mapping from host to rackid (used in testcases). But I think I should be able to set things in the configuration and StaticMapping could initialize itself with the mapping provided there. I'll look at that.
          Hide
          Owen O'Malley added a comment - - edited

          I'm worried about the time and memory performance of this. Have you run a sort with dfs cluster == map/reduce cluster and compared running times and job tracker memory size? We've already seen cases where the current pollForNewTask causes performance problems...

          It bothers me that the max levels is hard coded rather than configurable.

          From a style point of view, I probably would have defined a new class rather than use nested java.utils containers like List<Map<Node, List<TaskInProgress>>>. That way if we change the representation later it won't be scattered through the code. In particular, I can imagine wanting to have the data structure be something like:
          Map<String (rack name), RackInfo> and RackInfo has a Map<String (hostname), List<TaskInProgress> >. Or even more tree-like...

          Did you need to change the definition of findNewTask? I don't see it in the patch.

          This needs user documentation in forrest.

          The java doc on DNSToSwitchMapping.resolve should probably mention that they must cache if their operation is expensive. Although there isn't a way to clear or update that cache, which might be a problem at some point...

          You don't really need the Scan example, you could use the GenericMRLoadGenerator with a -keepmap of 0.

          In the longer term I think a configured mapping class would be useful. A class named
          org.apache.hadoop.net.ConfiguredNodeMapping that let you set the mapping in your config. Something like:

          <property>
             <name>hadoop.configured.node.mapping</name>
             <value>host1=/rack1,host2=/rack1,host3=/rack4</value>
          </property>
          
          Show
          Owen O'Malley added a comment - - edited I'm worried about the time and memory performance of this. Have you run a sort with dfs cluster == map/reduce cluster and compared running times and job tracker memory size? We've already seen cases where the current pollForNewTask causes performance problems... It bothers me that the max levels is hard coded rather than configurable. From a style point of view, I probably would have defined a new class rather than use nested java.utils containers like List<Map<Node, List<TaskInProgress>>>. That way if we change the representation later it won't be scattered through the code. In particular, I can imagine wanting to have the data structure be something like: Map<String (rack name), RackInfo> and RackInfo has a Map<String (hostname), List<TaskInProgress> >. Or even more tree-like... Did you need to change the definition of findNewTask? I don't see it in the patch. This needs user documentation in forrest. The java doc on DNSToSwitchMapping.resolve should probably mention that they must cache if their operation is expensive. Although there isn't a way to clear or update that cache, which might be a problem at some point... You don't really need the Scan example, you could use the GenericMRLoadGenerator with a -keepmap of 0. In the longer term I think a configured mapping class would be useful. A class named org.apache.hadoop.net.ConfiguredNodeMapping that let you set the mapping in your config. Something like: <property> <name>hadoop.configured.node.mapping</name> <value>host1=/rack1,host2=/rack1,host3=/rack4</value> </property>
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12373076/1985.v6.patch
          against trunk revision r611760.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests -1. The patch failed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1581/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1581/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1581/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1581/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12373076/1985.v6.patch against trunk revision r611760. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1581/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1581/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1581/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1581/console This message is automatically generated.
          Devaraj Das made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Devaraj Das made changes -
          Attachment 1985.v6.patch [ 12373076 ]
          Hide
          Devaraj Das added a comment -

          Updated patch.

          Show
          Devaraj Das added a comment - Updated patch.
          Devaraj Das made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Devaraj Das added a comment -

          The patch is out-of-sync with the trunk.

          Show
          Devaraj Das added a comment - The patch is out-of-sync with the trunk.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12372631/1985.v5.patch
          against trunk revision .

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests -1. The patch failed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1503/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1503/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1503/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1503/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12372631/1985.v5.patch against trunk revision . @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1503/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1503/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1503/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1503/console This message is automatically generated.
          Devaraj Das made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Devaraj Das made changes -
          Attachment 1985.v5.patch [ 12372631 ]
          Hide
          Devaraj Das added a comment -

          Fixed a problem with NetworkTopology's getNode method.

          Show
          Devaraj Das added a comment - Fixed a problem with NetworkTopology's getNode method.
          Devaraj Das made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12372415/1985.v4.patch
          against trunk revision .

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests -1. The patch failed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1454/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1454/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1454/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1454/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12372415/1985.v4.patch against trunk revision . @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1454/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1454/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1454/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1454/console This message is automatically generated.
          Devaraj Das made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Devaraj Das made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Devaraj Das added a comment -

          Cancelling patch to get the latest patch through hudson

          Show
          Devaraj Das added a comment - Cancelling patch to get the latest patch through hudson
          Devaraj Das made changes -
          Component/s mapred [ 12310690 ]
          Component/s dfs [ 12310710 ]
          Devaraj Das made changes -
          Attachment 1985.v4.patch [ 12372415 ]
          Hide
          Devaraj Das added a comment -

          HADOOP-2344 made this patch go out-of-sync. This is the updated one.

          Show
          Devaraj Das added a comment - HADOOP-2344 made this patch go out-of-sync. This is the updated one.
          Hide
          Devaraj Das added a comment -

          I ran the Scan benchmark attached in the patch (the benchmark just scans inputs; no sort/shuffle/reduce).

          The input data was generated on a cluster of ~300 machines. Randomwriter with the following config was run - 40 maps per host, each map configured to generate 1G, dfs blk size 256 MB. The input data set was thus around 11.6 TB.

          Another cluster of ~900 nodes, with its dfs pointing to the earlier 300 node cluster, was used to run the Scan benchmark. The number of maps was equal to the number of dfs blocks in the input.

          The two clusters had common racks but no common nodes. With the rack aware patch, the scan program took 25 minutes (with 90% rack-local tasks), and without the patch, the scan took around 35 minutes, ~30% improvement.

          Show
          Devaraj Das added a comment - I ran the Scan benchmark attached in the patch (the benchmark just scans inputs; no sort/shuffle/reduce). The input data was generated on a cluster of ~300 machines. Randomwriter with the following config was run - 40 maps per host, each map configured to generate 1G, dfs blk size 256 MB. The input data set was thus around 11.6 TB. Another cluster of ~900 nodes, with its dfs pointing to the earlier 300 node cluster, was used to run the Scan benchmark. The number of maps was equal to the number of dfs blocks in the input. The two clusters had common racks but no common nodes. With the rack aware patch, the scan program took 25 minutes (with 90% rack-local tasks), and without the patch, the scan took around 35 minutes, ~30% improvement.
          Devaraj Das made changes -
          Attachment 1985.v3.patch [ 12372151 ]
          Hide
          Devaraj Das added a comment -

          Fixed an issue in the Scan benchmark.

          Show
          Devaraj Das added a comment - Fixed an issue in the Scan benchmark.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12371728/1985.v2.patch
          against trunk revision r604451.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1356/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1356/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1356/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1356/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12371728/1985.v2.patch against trunk revision r604451. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1356/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1356/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1356/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1356/console This message is automatically generated.
          Devaraj Das made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Devaraj Das added a comment -

          Rerunning through hudson.

          Show
          Devaraj Das added a comment - Rerunning through hudson.
          Devaraj Das made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Devaraj Das made changes -
          Attachment 1985.v2.patch [ 12371728 ]
          Hide
          Devaraj Das added a comment -

          Fixed the findbugs issues.

          Show
          Devaraj Das added a comment - Fixed the findbugs issues.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12371666/1985.v1.patch
          against trunk revision r604058.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs -1. The patch appears to introduce 2 new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests -1. The patch failed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1344/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1344/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1344/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1344/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12371666/1985.v1.patch against trunk revision r604058. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs -1. The patch appears to introduce 2 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1344/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1344/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1344/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1344/console This message is automatically generated.
          Devaraj Das made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Devaraj Das added a comment -

          Passing it through hudson.

          Show
          Devaraj Das added a comment - Passing it through hudson.
          Devaraj Das made changes -
          Fix Version/s 0.16.0 [ 12312740 ]
          Devaraj Das made changes -
          Attachment 1985.v1.patch [ 12371666 ]
          Hide
          Devaraj Das added a comment -

          Attached is a tested patch. The patch has better documentation too. One of the important changes in the patch to do with testcases is the way it handles multiple datanodes in the same machine. Since the namenode should be able to distinguish between them in terms of the dnsToRackId mapping, I added a configuration option called "slave.host.name" that would take effect only when the framework is run under junit. Ditto applies to the jobtracker/tasktrackers. Also all communications to these dummy hostnames are redirected to "localhost" (indirectly via NetUtils.createSocketAddress).

          Patch is up for review.

          Show
          Devaraj Das added a comment - Attached is a tested patch. The patch has better documentation too. One of the important changes in the patch to do with testcases is the way it handles multiple datanodes in the same machine. Since the namenode should be able to distinguish between them in terms of the dnsToRackId mapping, I added a configuration option called "slave.host.name" that would take effect only when the framework is run under junit. Ditto applies to the jobtracker/tasktrackers. Also all communications to these dummy hostnames are redirected to "localhost" (indirectly via NetUtils.createSocketAddress). Patch is up for review.
          Devaraj Das made changes -
          Attachment 1985.new.patch [ 12369805 ]
          Hide
          Devaraj Das added a comment -

          Attached is one version of the patch. Hasn't been tested on large clusters yet. Here are the main changes:
          1) DFS part updated to use the newly defined DNSToSwitchMapping interface.
          1.1) The datanode doesn't send the switch info as part of registration, rather, the namenode gets that info through the
          DNSToSwitchMapping.resolve
          2) The default implementation of the DNSToSwitchMapping assumes a script based resolution (ScriptBasedMapping). If the script is defined, then DEFAULT_RACK is assumed.
          3) The JobTracker maintains the network topology and updates it (if required) whenever a tasktracker sends a heartbeat.
          4) The JobInProgress maintains a compile-time-configurable number of task caches. For e.g., the first level cache is the map of leaf level Nodes to TIPs, the second level is the map of the rack level nodes to TIPs in that rack, the third level is for the level above rack and so on.. The default number of caches here is hardcoded to 2.
          5) At runtime, the findNewTask would use these caches to assign a task to a checked-in tasktracker.

          Patch up for review.

          Show
          Devaraj Das added a comment - Attached is one version of the patch. Hasn't been tested on large clusters yet. Here are the main changes: 1) DFS part updated to use the newly defined DNSToSwitchMapping interface. 1.1) The datanode doesn't send the switch info as part of registration, rather, the namenode gets that info through the DNSToSwitchMapping.resolve 2) The default implementation of the DNSToSwitchMapping assumes a script based resolution (ScriptBasedMapping). If the script is defined, then DEFAULT_RACK is assumed. 3) The JobTracker maintains the network topology and updates it (if required) whenever a tasktracker sends a heartbeat. 4) The JobInProgress maintains a compile-time-configurable number of task caches. For e.g., the first level cache is the map of leaf level Nodes to TIPs, the second level is the map of the rack level nodes to TIPs in that rack, the third level is for the level above rack and so on.. The default number of caches here is hardcoded to 2. 5) At runtime, the findNewTask would use these caches to assign a task to a checked-in tasktracker. Patch up for review.
          Hide
          Hairong Kuang added a comment -

          Currently in dfs a datanode can get its network location either from the command line or by running a pluggable script at the startup time. The property is defined in the default configuration file as below.

          <property>
          <name>dfs.network.script</name>
          <value></value>
          <description>
          Specifies a script name that print the network location path
          of the current machine.
          </description>
          </property>

          Show
          Hairong Kuang added a comment - Currently in dfs a datanode can get its network location either from the command line or by running a pluggable script at the startup time. The property is defined in the default configuration file as below. <property> <name>dfs.network.script</name> <value></value> <description> Specifies a script name that print the network location path of the current machine. </description> </property>
          Hide
          eric baldeschwieler added a comment -

          I agree that an exec as a simple to config option should be required.

          Show
          eric baldeschwieler added a comment - I agree that an exec as a simple to config option should be required.
          Hide
          Allen Wittenauer added a comment -

          Just a few notes.

          From an ops perspective, it is important that this mapping be highly pluggable in an easy way. The ability to have hadoop call some sort of executable (not necessarily a script) means we can do fancy things with /etc/netmasks or LDAP lookups or ... . Ideally, every sort of mapping would have a callout rather than having one big one. KISS is important here. [Remember, most admins--myself included--are not hardcore Java people. ]

          FWIW, most implementations of autofs include similar functionality called executable maps where the key is passed to an exec and the exec returns the location of the mount. So the practice has at least a little bit of traction. [In fact, auto.net aka /net on Linux uses this method.]

          Additionally,I think moving this functionality to be done on the namenode makes this significantly easier to manage as a grid scales up. There is also the issue of should the namenode 'trust' the datanode to report the proper location. I understand that the datanode and namenode have to trust each other at some point during node bringup, but I think it makes a lot of sense to let the namenode be in charge of data locality.

          Hopefuly this was helpful.

          Show
          Allen Wittenauer added a comment - Just a few notes. From an ops perspective, it is important that this mapping be highly pluggable in an easy way. The ability to have hadoop call some sort of executable (not necessarily a script) means we can do fancy things with /etc/netmasks or LDAP lookups or ... . Ideally, every sort of mapping would have a callout rather than having one big one. KISS is important here. [Remember, most admins--myself included--are not hardcore Java people. ] FWIW, most implementations of autofs include similar functionality called executable maps where the key is passed to an exec and the exec returns the location of the mount. So the practice has at least a little bit of traction. [In fact, auto.net aka /net on Linux uses this method.] Additionally,I think moving this functionality to be done on the namenode makes this significantly easier to manage as a grid scales up. There is also the issue of should the namenode 'trust' the datanode to report the proper location. I understand that the datanode and namenode have to trust each other at some point during node bringup, but I think it makes a lot of sense to let the namenode be in charge of data locality. Hopefuly this was helpful.
          Hide
          eric baldeschwieler added a comment -

          works for me

          Show
          eric baldeschwieler added a comment - works for me
          Hide
          Devaraj Das added a comment - - edited

          Some thoughts -
          1) Make the DNS->Switch mapping an interface class.
          1.1) interface DNStoSwitchMap

          { public String resolve(String dnsname); }

          1.2) The switch string format is the same as it exists today (documented in https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf). That will make things work in the non-typical setup with 3+ levels of nodes.
          1.3) The default implementation of the interface, packaged with hadoop, could simply look up a table of dns->switch mapping created statically.

          2) The DataNode, today, takes the location as an argument. This is not needed anymore, and hence the associated code would go away.
          3) The DataNode sends the location information as part of the registration. The NetworkTopology is derived at the NameNode. Using the interface mentioned in (1), the NameNode can create the topology all by itself.

          4) The JobTracker creates the NetworkTopology for the TaskTrackers exactly how the NameNode does it.
          5) The JobTracker assigns tasks first on node-locality basis, then on rack-locality basis.

          In our environment, task placement based on "distance" (o.a.h.n.NetworkTopology.getDistance), isn't that much relevant since we only have flat racks of machines. But we might make the framework ready for it as well (assuming it is not too much work).

          Does the above make sense?

          Show
          Devaraj Das added a comment - - edited Some thoughts - 1) Make the DNS->Switch mapping an interface class. 1.1) interface DNStoSwitchMap { public String resolve(String dnsname); } 1.2) The switch string format is the same as it exists today (documented in https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf ). That will make things work in the non-typical setup with 3+ levels of nodes. 1.3) The default implementation of the interface, packaged with hadoop, could simply look up a table of dns->switch mapping created statically. 2) The DataNode, today, takes the location as an argument. This is not needed anymore, and hence the associated code would go away. 3) The DataNode sends the location information as part of the registration. The NetworkTopology is derived at the NameNode. Using the interface mentioned in (1), the NameNode can create the topology all by itself. 4) The JobTracker creates the NetworkTopology for the TaskTrackers exactly how the NameNode does it. 5) The JobTracker assigns tasks first on node-locality basis, then on rack-locality basis. In our environment, task placement based on "distance" (o.a.h.n.NetworkTopology.getDistance), isn't that much relevant since we only have flat racks of machines. But we might make the framework ready for it as well (assuming it is not too much work). Does the above make sense?
          Devaraj Das made changes -
          Field Original Value New Value
          Assignee Devaraj Das [ devaraj ]
          Hide
          Runping Qi added a comment -

          yes, DNS name (hostname) to switch id mapping should be managed just like the hostname to IP mapping.
          The info should be available to the DFS namenode, datanodes and applications in the same way.
          Job tracker uses this info for task assignment. In general, DFS client should also use this info to decide where to fetch a needed block.

          Show
          Runping Qi added a comment - yes, DNS name (hostname) to switch id mapping should be managed just like the hostname to IP mapping. The info should be available to the DFS namenode, datanodes and applications in the same way. Job tracker uses this info for task assignment. In general, DFS client should also use this info to decide where to fetch a needed block.
          eric baldeschwieler created issue -

            People

            • Assignee:
              Devaraj Das
              Reporter:
              eric baldeschwieler
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development