Hadoop Common
  1. Hadoop Common
  2. HADOOP-5759

IllegalArgumentException when CombineFileInputFormat is used as job InputFormat

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.1, 0.20.205.0
    • Fix Version/s: 0.20.2
    • Component/s: None
    • Labels:
      None

      Description

      As per my understanding, CombineFileInputFormat is creating splits with rackname as split location.
      When I use CombineFileInputFormat as the InputFormat for job, job initialization fails with following exception :
      2009-04-28 14:10:40,162 ERROR mapred.EagerTaskInitializationListener (EagerTaskInitializationListener.java:run(83)) - Job initialization failed:
      java.lang.IllegalArgumentException: Network location name contains /: /default-rack
      at org.apache.hadoop.net.NodeBase.set(NodeBase.java:76)
      at org.apache.hadoop.net.NodeBase.<init>(NodeBase.java:57)
      at org.apache.hadoop.mapred.JobTracker.addHostToNodeMapping(JobTracker.java:2342)
      at org.apache.hadoop.mapred.JobTracker.resolveAndAddToTopology(JobTracker.java:2336)
      at org.apache.hadoop.mapred.JobInProgress.createCache(JobInProgress.java:344)
      at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:441)
      at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:81)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
      at java.lang.Thread.run(Thread.java:619)

      When I changed CombineFileInputFormat to pass just rackname (without '/'), JT wrongly resolves the node as /default-rack/<rack-name>.

      Solution is to pass hostnames holding the block(on the rack), instead of rackname.

      1. patch-5759-ydist.txt
        12 kB
        Amareshwari Sriramadasu
      2. patch-5759-2.txt
        11 kB
        Amareshwari Sriramadasu
      3. patch-5759-1.txt
        11 kB
        Amareshwari Sriramadasu
      4. patch-5759.txt
        13 kB
        Amareshwari Sriramadasu

        Issue Links

          Activity

          Hide
          Amareshwari Sriramadasu added a comment -

          Patch changing racknames to hostnames.
          Unit test to do "submit a job with CombineFileInputFormat "will added in HADOOP-5698. Now, I manually tested the patch by running multifile wordcount using CombineFileInputFormat

          Show
          Amareshwari Sriramadasu added a comment - Patch changing racknames to hostnames. Unit test to do "submit a job with CombineFileInputFormat "will added in HADOOP-5698 . Now, I manually tested the patch by running multifile wordcount using CombineFileInputFormat
          Hide
          Jothi Padmanabhan added a comment -

          The patch does not apply cleanly because of the change to the test directory structure (src/test/org.. should be src/test/mapred/org/...)

          Show
          Jothi Padmanabhan added a comment - The patch does not apply cleanly because of the change to the test directory structure (src/test/org.. should be src/test/mapred/org/...)
          Hide
          Jothi Padmanabhan added a comment -

          Since only hosts that actually contain the valid blocks are returned in getMoreSplits with this patch, it appears that we are losing the rack level aggregation that was intended for this InputFormat.

          Dhruba, could you take a look as well?

          Show
          Jothi Padmanabhan added a comment - Since only hosts that actually contain the valid blocks are returned in getMoreSplits with this patch, it appears that we are losing the rack level aggregation that was intended for this InputFormat. Dhruba, could you take a look as well?
          Hide
          dhruba borthakur added a comment -

          This patch keeps the original aim of combining blocks from different hosts in the same rack into a single split. The fix that that patch attempts is to figure out where such a combined split should reside.

          > Since only hosts that actually contain the valid blocks are returned in getMoreSplits with this patch,

          I agree with Jothi to a certain extent. The number ofsplits remain the same before, but the possibility of scheduling them on the rack where they reside is slightly reduced because we look only at those hosts where this block belongs. It is possible to enhance this patch to create a new data strcture called rackToNodes at the very beginning. It can be populated by iterating through all the blocks at the very beginning.

          Show
          dhruba borthakur added a comment - This patch keeps the original aim of combining blocks from different hosts in the same rack into a single split. The fix that that patch attempts is to figure out where such a combined split should reside. > Since only hosts that actually contain the valid blocks are returned in getMoreSplits with this patch, I agree with Jothi to a certain extent. The number ofsplits remain the same before, but the possibility of scheduling them on the rack where they reside is slightly reduced because we look only at those hosts where this block belongs. It is possible to enhance this patch to create a new data strcture called rackToNodes at the very beginning. It can be populated by iterating through all the blocks at the very beginning.
          Hide
          Jothi Padmanabhan added a comment -

          It is possible to enhance this patch to create a new data strcture called rackToNodes at the very beginning. It can be populated by iterating through all the blocks at the very beginning.

          Agreed. However, Amareshwari and I discussed this offline and we are not sure if we want to build this rackToNodes before getMoreSplits method as it would mean calling getBlockLocations for all the blocks twice – once to build the rackToNodes and once in getMoreSplits to build the blockInfo maps. Could we incrementally build this map in getMoreSplits. This would have the disadvantage of having incomplete information for the first few splits.

          Show
          Jothi Padmanabhan added a comment - It is possible to enhance this patch to create a new data strcture called rackToNodes at the very beginning. It can be populated by iterating through all the blocks at the very beginning. Agreed. However, Amareshwari and I discussed this offline and we are not sure if we want to build this rackToNodes before getMoreSplits method as it would mean calling getBlockLocations for all the blocks twice – once to build the rackToNodes and once in getMoreSplits to build the blockInfo maps. Could we incrementally build this map in getMoreSplits. This would have the disadvantage of having incomplete information for the first few splits.
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch adding rackToNodes map incrementally.

          Show
          Amareshwari Sriramadasu added a comment - Patch adding rackToNodes map incrementally.
          Hide
          Amareshwari Sriramadasu added a comment -

          test-patch result:

               [exec]
               [exec] +1 overall.
               [exec]
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec]
               [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
               [exec]
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec]
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec]
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec]
               [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
               [exec]
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
               [exec]
          
          Show
          Amareshwari Sriramadasu added a comment - test-patch result: [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec]
          Hide
          dhruba borthakur added a comment -

          Code looks good to me. A few minor comments:

          1. Is the change to addCreatedSplit() not needed anymore?
          2. Since we added a static data structure rackToNodes, it might make sense to empty it just before returning from getSplits. This will ensure that we free up that memory sooner rather than later.

          Show
          dhruba borthakur added a comment - Code looks good to me. A few minor comments: 1. Is the change to addCreatedSplit() not needed anymore? 2. Since we added a static data structure rackToNodes, it might make sense to empty it just before returning from getSplits. This will ensure that we free up that memory sooner rather than later.
          Hide
          Amareshwari Sriramadasu added a comment -

          1. Is the change to addCreatedSplit() not needed anymore?

          @@ -412,7 +417,7 @@
              */
             private void addCreatedSplit(JobConf job,
                                          List<CombineFileSplit> splitList, 
          -                               List<String> racks, 
          +                               List<String> locations, 
                                          ArrayList<OneBlockInfo> validBlocks) {
               // create an input split
          

          I changed the above, because racks are not passed to addCreatedSplit any more.

          Added rackToNodes.clear() just before returning splits from getSplits, as suggested by Dhruba.

          Show
          Amareshwari Sriramadasu added a comment - 1. Is the change to addCreatedSplit() not needed anymore? @@ -412,7 +417,7 @@ */ private void addCreatedSplit(JobConf job, List<CombineFileSplit> splitList, - List< String > racks, + List< String > locations, ArrayList<OneBlockInfo> validBlocks) { // create an input split I changed the above, because racks are not passed to addCreatedSplit any more. Added rackToNodes.clear() just before returning splits from getSplits, as suggested by Dhruba.
          Hide
          dhruba borthakur added a comment -

          +1 Code looks good.

          Show
          dhruba borthakur added a comment - +1 Code looks good.
          Hide
          Amareshwari Sriramadasu added a comment -

          test-patch and ant test passed on my machine.

          Show
          Amareshwari Sriramadasu added a comment - test-patch and ant test passed on my machine.
          Hide
          Amareshwari Sriramadasu added a comment -

          test-patch :

               [exec] +1 overall.
               [exec]
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec]
               [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
               [exec]
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec]
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec]
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec]
               [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
               [exec]
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
               [exec]
          
          Show
          Amareshwari Sriramadasu added a comment - test-patch : [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec]
          Hide
          dhruba borthakur added a comment -

          I just committed this. Thanks Amareshwari!

          Show
          dhruba borthakur added a comment - I just committed this. Thanks Amareshwari!
          Hide
          Zheng Shao added a comment -

          Committed to branch-0.20.

          Show
          Zheng Shao added a comment - Committed to branch-0.20.
          Hide
          Gaurav Jain added a comment -

          I am trying to use CombineFile* classes to use in Zebra. After this patch, it works for me, so far.

          However, I was wondering if somebody has done any performance analysis on CombineFile* vs InputFile* ( data locality vs ( non ) data-locality )

          If so, can I have access to it?

          Show
          Gaurav Jain added a comment - I am trying to use CombineFile* classes to use in Zebra. After this patch, it works for me, so far. However, I was wondering if somebody has done any performance analysis on CombineFile* vs InputFile* ( data locality vs ( non ) data-locality ) If so, can I have access to it?
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch for Yahoo! distribution

          Show
          Amareshwari Sriramadasu added a comment - Patch for Yahoo! distribution
          Hide
          Joep Rottinghuis added a comment -

          The "patch-5759-ydist.txt" patch should be applied to branch-0.20-security and the 0.20.205 branch.

          Show
          Joep Rottinghuis added a comment - The "patch-5759-ydist.txt" patch should be applied to branch-0.20-security and the 0.20.205 branch.

            People

            • Assignee:
              Amareshwari Sriramadasu
              Reporter:
              Amareshwari Sriramadasu
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development