Hadoop Common
  1. Hadoop Common
  2. HADOOP-4567

GetFileBlockLocations should return the NetworkTopology information of the machines that hosts those blocks

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.20.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Changed GetFileBlockLocations to return topology information for nodes that host the block replicas.

      Description

      MultiFileInputFormat and FileInputFormat should use block locality information to construct splits.

      1. dfsRackLocation.patch
        5 kB
        dhruba borthakur
      2. dfsRackLocation2.patch
        6 kB
        dhruba borthakur
      3. dfsRackLocation3.patch
        6 kB
        dhruba borthakur

        Issue Links

          Activity

          Hide
          Robert Chansler added a comment -

          Edit release note for publication.

          Show
          Robert Chansler added a comment - Edit release note for publication.
          Hide
          dhruba borthakur added a comment -

          I just committed this.

          Show
          dhruba borthakur added a comment - I just committed this.
          Hide
          dhruba borthakur added a comment -

          GetFileBlockLocations returns the NetworkTopology information of the machines on where the blocks reside.

          Show
          dhruba borthakur added a comment - GetFileBlockLocations returns the NetworkTopology information of the machines on where the blocks reside.
          Hide
          Jothi Padmanabhan added a comment -

          +1. Looks good.

          Show
          Jothi Padmanabhan added a comment - +1. Looks good.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12393321/dfsRackLocation3.patch
          against trunk revision 711350.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3528/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3528/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3528/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3528/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12393321/dfsRackLocation3.patch against trunk revision 711350. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3528/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3528/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3528/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3528/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          Thanks for catching the debug print. I have been making this mistake too often in the last few days

          Show
          dhruba borthakur added a comment - Thanks for catching the debug print. I have been making this mistake too often in the last few days
          Hide
          Jothi Padmanabhan added a comment -

          Patch looks good. Is the

           System.out.println("XXX" +  topologyPaths[j]); 

          in TestReplication.java intentional or did it just slip in

          Show
          Jothi Padmanabhan added a comment - Patch looks good. Is the System .out.println( "XXX" + topologyPaths[j]); in TestReplication.java intentional or did it just slip in
          Hide
          Jothi Padmanabhan added a comment -

          GetTopologyPaths returning /r1/h1 looks good to me as well.

          Show
          Jothi Padmanabhan added a comment - GetTopologyPaths returning /r1/h1 looks good to me as well.
          Hide
          dhruba borthakur added a comment -

          BlockLocation.getTopologyPath() returns the full network location of each of the hosts. Th last component of each path is the host:port.

          Show
          dhruba borthakur added a comment - BlockLocation.getTopologyPath() returns the full network location of each of the hosts. Th last component of each path is the host:port.
          Hide
          dhruba borthakur added a comment -

          Got it. Actually the queston you are raising is whether BlockLocation.getTopologyPaths() should return /r1 or should it return /r1/h1. I would prefer that this new method return /r1/h1. If somebody wants only the hostname, they can continue to use the existing method BlockLocation.getHostName().

          Show
          dhruba borthakur added a comment - Got it. Actually the queston you are raising is whether BlockLocation.getTopologyPaths() should return /r1 or should it return /r1/h1. I would prefer that this new method return /r1/h1. If somebody wants only the hostname, they can continue to use the existing method BlockLocation.getHostName().
          Hide
          Jothi Padmanabhan added a comment -

          Sorry, let me try to explain it better. Let us assume that the replication factor is 3 and we have one block. GetFileBlockLocation would return in BlockLocations[0].hosts

          {h1, h2, h3}

          and in BlockLocations[0].racks

          {r1,r1,r2}

          where h1 and h2 are in r1 and h3 is in r2. The topologyPaths for the different hosts are determined by their relative positions in their arrays. The topologyPath for h1, which is in index 0 of hosts array, is r1 (index 0 in the racks array). An alternative could have just been to return /r1/h1, /r1/h2 and /r2/h3 in the hosts. That way, if somebody wants both the host and rack information, they do not need to construct it by reading the same index in two arrays. Makes sense?

          Show
          Jothi Padmanabhan added a comment - Sorry, let me try to explain it better. Let us assume that the replication factor is 3 and we have one block. GetFileBlockLocation would return in BlockLocations [0] .hosts {h1, h2, h3} and in BlockLocations [0] .racks {r1,r1,r2} where h1 and h2 are in r1 and h3 is in r2. The topologyPaths for the different hosts are determined by their relative positions in their arrays. The topologyPath for h1, which is in index 0 of hosts array, is r1 (index 0 in the racks array). An alternative could have just been to return /r1/h1, /r1/h2 and /r2/h3 in the hosts. That way, if somebody wants both the host and rack information, they do not need to construct it by reading the same index in two arrays. Makes sense?
          Hide
          dhruba borthakur added a comment -

          Thanks for the review comments. I made it a separate variable so that the API clearly identifies the hostname and the network topology.

          >Having a separate racks variable implies that there is an implicit requirement that the the ordering of hosts and the racks are identical (which is true in this case

          I missed your point. Can you pl explain it in greater detal? Thanks.

          Show
          dhruba borthakur added a comment - Thanks for the review comments. I made it a separate variable so that the API clearly identifies the hostname and the network topology. >Having a separate racks variable implies that there is an implicit requirement that the the ordering of hosts and the racks are identical (which is true in this case I missed your point. Can you pl explain it in greater detal? Thanks.
          Hide
          Jothi Padmanabhan added a comment -

          This looks good. The only debatable point is whether to introduce a new 'racks' variable in BlockLocations or to just prefix the network topology in the hosts variable itself. Having a separate racks variable implies that there is an implicit requirement that the the ordering of hosts and the racks are identical (which is true in this case). However, having a separate racks variable does makes it easier to handle cases where there is no topology information available, just a simple check on the racks variable would do instead of adding logic during the parsing of host names. I am fine with either approach.

          Show
          Jothi Padmanabhan added a comment - This looks good. The only debatable point is whether to introduce a new 'racks' variable in BlockLocations or to just prefix the network topology in the hosts variable itself. Having a separate racks variable implies that there is an implicit requirement that the the ordering of hosts and the racks are identical (which is true in this case). However, having a separate racks variable does makes it easier to handle cases where there is no topology information available, just a simple check on the racks variable would do instead of adding logic during the parsing of host names. I am fine with either approach.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12393204/dfsRackLocation.patch
          against trunk revision 709609.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3519/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3519/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3519/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3519/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12393204/dfsRackLocation.patch against trunk revision 709609. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3519/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3519/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3519/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3519/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          Review comments welcome.

          Show
          dhruba borthakur added a comment - Review comments welcome.
          Hide
          dhruba borthakur added a comment -

          FileSystem.getFileBlockLocation also returns the rack location of the blocks that are returned.

          Show
          dhruba borthakur added a comment - FileSystem.getFileBlockLocation also returns the rack location of the blocks that are returned.

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              dhruba borthakur
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development