Hadoop Common
  1. Hadoop Common
  2. HADOOP-3293

When an input split spans cross block boundary, the split location should be the host having most of bytes on it.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.20.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    1. hadoop-3293.patch
      13 kB
      Jothi Padmanabhan

      Issue Links

        Activity

        Hide
        Jothi Padmanabhan added a comment -

        The reason for this is that FileInputFormat.getBlockIndex() returns the blockindex of the starting block for the given offset. Instead, it should identify all the blocks that this particular split spans and then choose the block that contributes the maximum data for this split.

        We could use the following approach

        //Calculate the number of blocks the split spans
        if (numBlocks == 1)
          return startIndex;
        else if (numBlocks == 2)
          return (bytesInFirstBlock > bytesInSecondBlock) ? startIndex:startIndex+1;
        else 
          return startIndex + 1;
        

        The rationale here is that if there are more than two blocks, we are guaranteed that block 2 is contributing its entire block length for this split.

        Note that we cannot do the identification of the block index based on the amount of data contributed by the individual host, because of the replication factor.
        For example, consider the following example (assume dfs block size = 100)
        Block 1 contributes 20 bytes and its hosts are A,B,C
        Block 2 contributes 100 bytes and its hosts are A, D,E
        Block 3 contributes 10 bytes and its hosts are D,E,F

        If we aggregate on a per host basis, host A having contributed 120 bytes would be the ideal choice. However, if we choose Block 1 as the index to be returned, even hosts B &C would be treated as data local, which is sub optimal.
        Thoughts?

        Show
        Jothi Padmanabhan added a comment - The reason for this is that FileInputFormat.getBlockIndex() returns the blockindex of the starting block for the given offset. Instead, it should identify all the blocks that this particular split spans and then choose the block that contributes the maximum data for this split. We could use the following approach //Calculate the number of blocks the split spans if (numBlocks == 1) return startIndex; else if (numBlocks == 2) return (bytesInFirstBlock > bytesInSecondBlock) ? startIndex:startIndex+1; else return startIndex + 1; The rationale here is that if there are more than two blocks, we are guaranteed that block 2 is contributing its entire block length for this split. Note that we cannot do the identification of the block index based on the amount of data contributed by the individual host, because of the replication factor. For example, consider the following example (assume dfs block size = 100) Block 1 contributes 20 bytes and its hosts are A,B,C Block 2 contributes 100 bytes and its hosts are A, D,E Block 3 contributes 10 bytes and its hosts are D,E,F If we aggregate on a per host basis, host A having contributed 120 bytes would be the ideal choice. However, if we choose Block 1 as the index to be returned, even hosts B &C would be treated as data local, which is sub optimal. Thoughts?
        Hide
        Jothi Padmanabhan added a comment -

        If we aggregate on a per host basis, host A having contributed 120 bytes would be the ideal choice. However, if we choose Block 1 as the index to be returned, even hosts B &C would be treated as data local, which is sub optimal.

        To make this clear – Having decided that A is a good host, we now should also have a good way to decide to pick the correct block from the list of blocks that reside in A. In this case, we should choose between Block1 and Block2. If Block 1 is chosen, it is not very optimal as hosts B & C have only 20 bytes with them.

        Show
        Jothi Padmanabhan added a comment - If we aggregate on a per host basis, host A having contributed 120 bytes would be the ideal choice. However, if we choose Block 1 as the index to be returned, even hosts B &C would be treated as data local, which is sub optimal. To make this clear – Having decided that A is a good host, we now should also have a good way to decide to pick the correct block from the list of blocks that reside in A. In this case, we should choose between Block1 and Block2. If Block 1 is chosen, it is not very optimal as hosts B & C have only 20 bytes with them.
        Hide
        Jothi Padmanabhan added a comment -

        Since the BlkIndex is used only to identify the hosts,

                int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining,
                                               splitSize);
                  splits.add(new FileSplit(path, length-bytesRemaining, splitSize, 
                                           blkLocations[blkIndex].getHosts()));
        

        we could also modify getBlockIndex() to return a list of hosts that contain the maximum data for that split. For example, if the split was
        Block1 80Bytes Hosts-A,B,C
        Block2 100Bytes Hosts A,D,E
        Block 3 70Bytes Hosts D,F,B

        We would identify the hosts and their contribution as
        A 180
        B 150
        C 80
        D 170
        E 100
        F 70

        We could return A,D,B

        Show
        Jothi Padmanabhan added a comment - Since the BlkIndex is used only to identify the hosts, int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining, splitSize); splits.add( new FileSplit(path, length-bytesRemaining, splitSize, blkLocations[blkIndex].getHosts())); we could also modify getBlockIndex() to return a list of hosts that contain the maximum data for that split. For example, if the split was Block1 80Bytes Hosts-A,B,C Block2 100Bytes Hosts A,D,E Block 3 70Bytes Hosts D,F,B We would identify the hosts and their contribution as A 180 B 150 C 80 D 170 E 100 F 70 We could return A,D,B
        Hide
        Runping Qi added a comment -

        In the above case, I'd say that the prefer hosts for the split should be in the order of A,D,B,E,C,F.
        We should also aggregate the bytes over the racks of those hosts.
        For example, suppose C,E,F share the same rack while other nodes are on different rack.
        Then host E (F, and even C) will offer better rack locality than other hosts.
        In practice, rack locality is almost as good as node locality.

        Show
        Runping Qi added a comment - In the above case, I'd say that the prefer hosts for the split should be in the order of A,D,B,E,C,F. We should also aggregate the bytes over the racks of those hosts. For example, suppose C,E,F share the same rack while other nodes are on different rack. Then host E (F, and even C) will offer better rack locality than other hosts. In practice, rack locality is almost as good as node locality.
        Hide
        Jothi Padmanabhan added a comment -

        Runping's point on aggregating bytes over racks to determine rack locality makes sense.

        The problem is that the JobClient is unaware of the topology. Some ways to build the topology awareness are:

        1. Make the JobClient query the topology service and build its own topology awareness. The problem with this approach is that we need to ensure that the topology script that is used by the JobClient and the JobTracker are always in sync.
        2. Let the client get back rack information along with the hosts when it queries the FS for block locations (fs.getFileBlockLocations). We could add a new method fs.getResolvedFileBlockLocations that returns the hosts with the rack information. The default implementation would just return the hosts, DFS would override this method and will return the rack information along with the hosts. We are guranteed correct topology information as the Namenode and JobTracker would be using the same topology information.

        The second approach looks better. Thoughts?

        Show
        Jothi Padmanabhan added a comment - Runping's point on aggregating bytes over racks to determine rack locality makes sense. The problem is that the JobClient is unaware of the topology. Some ways to build the topology awareness are: Make the JobClient query the topology service and build its own topology awareness. The problem with this approach is that we need to ensure that the topology script that is used by the JobClient and the JobTracker are always in sync. Let the client get back rack information along with the hosts when it queries the FS for block locations (fs.getFileBlockLocations). We could add a new method fs.getResolvedFileBlockLocations that returns the hosts with the rack information. The default implementation would just return the hosts, DFS would override this method and will return the rack information along with the hosts. We are guranteed correct topology information as the Namenode and JobTracker would be using the same topology information. The second approach looks better. Thoughts?
        Hide
        Jothi Padmanabhan added a comment -

        Runping, just to make sure I understand this correctly, consider the following hypothetical scenario

        Rack 1 has hosts A,B,C
        Rack 2 has host D

        A contributes 25, B contributes 20, C contributes 15, D contributes 40

        Then the preference of hosts would be A,B,C,D as A, B and C have an intra-rack contribution of 60 and external contribution of 40, whereas D has intra rack contribution of 40 and external contribution of 60.
        Is this correct?

        Show
        Jothi Padmanabhan added a comment - Runping, just to make sure I understand this correctly, consider the following hypothetical scenario Rack 1 has hosts A,B,C Rack 2 has host D A contributes 25, B contributes 20, C contributes 15, D contributes 40 Then the preference of hosts would be A,B,C,D as A, B and C have an intra-rack contribution of 60 and external contribution of 40, whereas D has intra rack contribution of 40 and external contribution of 60. Is this correct?
        Hide
        Runping Qi added a comment -

        yeh.

        Show
        Runping Qi added a comment - yeh.
        Hide
        dhruba borthakur added a comment -

        > DFS would override this method and will return the rack information along with the hosts.

        This is a good idea, but returning only rack location might not work in the general case when there are more than 2 levels in the network topology. Knowing the name of a rack might not tell you how close it is to another rack. But getFileBlockLocations could return the complete path of the host in the network topology. I will provide a patch for this one. See HADOOP-4567

        Show
        dhruba borthakur added a comment - > DFS would override this method and will return the rack information along with the hosts. This is a good idea, but returning only rack location might not work in the general case when there are more than 2 levels in the network topology. Knowing the name of a rack might not tell you how close it is to another rack. But getFileBlockLocations could return the complete path of the host in the network topology. I will provide a patch for this one. See HADOOP-4567
        Hide
        Jothi Padmanabhan added a comment -

        Initial Patch for review

        Show
        Jothi Padmanabhan added a comment - Initial Patch for review
        Hide
        Tsz Wo Nicholas Sze added a comment -

        > Initial Patch for review

        For Initial Patch, you might not want to submit it since the patch is likely needed to be updated. Also, you might want to test it locally before submitting it.

        Show
        Tsz Wo Nicholas Sze added a comment - > Initial Patch for review For Initial Patch, you might not want to submit it since the patch is likely needed to be updated. Also, you might want to test it locally before submitting it.
        Hide
        Jothi Padmanabhan added a comment -

        Sorry, I should have said "Patch for review"; the Patch was locally tested.
        I also did a test to demonstrate the performance improvement from the patch. I allocated a 440 node cluster, ran randomwriter with 40 maps, each map output 25G. I then killed the task trackers on the nodes that ran the maps. I then ran a modified sort (no map output, no reduces) with a minimum input split of 10G. If found that, over an average of three runs, patch was about 17 seconds faster than the trunk (175 secs as opposed to 192 secs)

        Show
        Jothi Padmanabhan added a comment - Sorry, I should have said "Patch for review"; the Patch was locally tested. I also did a test to demonstrate the performance improvement from the patch. I allocated a 440 node cluster, ran randomwriter with 40 maps, each map output 25G. I then killed the task trackers on the nodes that ran the maps. I then ran a modified sort (no map output, no reduces) with a minimum input split of 10G. If found that, over an average of three runs, patch was about 17 seconds faster than the trunk (175 secs as opposed to 192 secs)
        Hide
        Tsz Wo Nicholas Sze added a comment -

        > the Patch was locally tested.
        Have you run all unit tests by "ant test"? It seems that the patch has some problems and it is currently stuck in Hudson. See http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3593/

        Show
        Tsz Wo Nicholas Sze added a comment - > the Patch was locally tested. Have you run all unit tests by "ant test"? It seems that the patch has some problems and it is currently stuck in Hudson. See http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3593/
        Hide
        Jothi Padmanabhan added a comment -

        Yes, I did

        Show
        Jothi Padmanabhan added a comment - Yes, I did
        Hide
        Tsz Wo Nicholas Sze added a comment -

        I see. There must be something wrong in the Hudson machine. It seems having a lot problems recently.

        Show
        Tsz Wo Nicholas Sze added a comment - I see. There must be something wrong in the Hudson machine. It seems having a lot problems recently.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12393879/hadoop-3293.patch
        against trunk revision 714107.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3594/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3594/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3594/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3594/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12393879/hadoop-3293.patch against trunk revision 714107. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3594/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3594/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3594/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3594/console This message is automatically generated.
        Hide
        Devaraj Das added a comment -

        I just committed this. Thanks, Jothi!

        Show
        Devaraj Das added a comment - I just committed this. Thanks, Jothi!

          People

          • Assignee:
            Jothi Padmanabhan
            Reporter:
            Runping Qi
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development