Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1743

conf.get("map.input.file") returns null when using MultipleInputs in Hadoop 0.20

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.20.2
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      There is a problem in getting the input file name in the mapper when uisng MultipleInputs in Hadoop 0.20. I need to use MultipleInputs to support different formats for my inputs to the my MapReduce job. And inside each mapper, I also need to know the exact input file that the mapper is processing. However, conf.get("map.input.file") returns null. Can anybody help me solve this problem? Thanks in advance.

      public class Test extends Configured implements Tool{

      static class InnerMapper extends MapReduceBase implements Mapper<Writable, Writable, NullWritable, Text>
      {
      ................
      ................

      public void configure(JobConf conf)

      { String inputName=conf.get("map.input.file")); ....................................... }

      }

      public int run(String[] arg0) throws Exception

      { JonConf job; job = new JobConf(Test.class); ........................................... MultipleInputs.addInputPath(conf, new Path("A"), TextInputFormat.class); MultipleInputs.addInputPath(conf, new Path("B"), SequenceFileFormat.class); ........................................... }

      }

      1. mr-1743.diff
        2 kB
        Liyin Liang

        Activity

        Hide
        Ted Yu added a comment -

        There is this check in updateJobWithSplit():
        if (inputSplit instanceof FileSplit) {

        Since TaggedInputSplit isn't an instance of FileSplit, "map.input.file" isn't set by updateJobWithSplit().

        According to the usage in DelegatingInputFormat.getSplits():
        splits.add(new TaggedInputSplit(pathSplit, conf, format.getClass(),
        mapperClass));
        One solution is to let TaggedInputSplit extend FileSplit.

        Show
        Ted Yu added a comment - There is this check in updateJobWithSplit(): if (inputSplit instanceof FileSplit) { Since TaggedInputSplit isn't an instance of FileSplit, "map.input.file" isn't set by updateJobWithSplit(). According to the usage in DelegatingInputFormat.getSplits(): splits.add(new TaggedInputSplit(pathSplit, conf, format.getClass(), mapperClass)); One solution is to let TaggedInputSplit extend FileSplit.
        Hide
        Tom White added a comment -

        > One solution is to let TaggedInputSplit extend FileSplit.

        Except that MultipleInputs works with any InputFormat, not just FileInputFormat. Another way of doing this would be to invert the logic so that the InputSplit updates properties on Configuration (this would need a new method on InputSplit).

        Show
        Tom White added a comment - > One solution is to let TaggedInputSplit extend FileSplit. Except that MultipleInputs works with any InputFormat, not just FileInputFormat. Another way of doing this would be to invert the logic so that the InputSplit updates properties on Configuration (this would need a new method on InputSplit).
        Hide
        Luo Yi added a comment -

        the following code may get the true file name from the TaggedInputSplit. because TaggedInputSplit is a hadoop inner class ,you should make your class in the org.apache.hadoop.mapred.lib classspace:

        TaggedInputSplitGetName.java
        InputSplit is = reporter.getInputSplit();
        String name = is.getClass().getName();
        if ( name.compareTo("org.apache.hadoop.mapred.FileSplit") == 0 ) {
            FileSplit fs = (FileSplit)is;
            String path = fs.getPath().toString();
            word.set(path);
            output.collect(word, one);
        }
        if ( name.compareTo("org.apache.hadoop.mapred.lib.TaggedInputSplit") == 0 ) {
            TaggedInputSplit tis = (TaggedInputSplit)is;
            InputSplit iis = tis.getInputSplit();
            String iname = iis.getClass().getName();
            word.set(iname);
            output.collect(word, one);
            if ( iname.compareTo("org.apache.hadoop.mapred.FileSplit") == 0 ) {
                FileSplit fs = (FileSplit)iis;
               // the path from the TaggedInputSplit should be prefixed by "convert: "
                String path = "convert: " + fs.getPath().toString();
                word.set(path);
                output.collect(word, one);
            }
        }
        
        and the output file give me : 
        
        {noformat}
        $ grep 'convert' testout/part-00000 |head -n 5
        convert: hdfs://myowndir/pt=20100513000000/attempt_201003291206_327196_r_000000_0    1
        convert: hdfs://myowndir/pt=20100513000000/attempt_201003291206_327196_r_000001_0    1
        convert: hdfs://myowndir/pt=20100513000000/attempt_201003291206_327196_r_000002_0    1
        convert: hdfs://myowndir/pt=20100513000000/attempt_201003291206_327196_r_000003_0    1
        convert: hdfs://myowndir/pt=20100513000000/attempt_201003291206_327196_r_000004_0    1
        {noformat} 
        
        you may give it a try.
        
        
        Show
        Luo Yi added a comment - the following code may get the true file name from the TaggedInputSplit. because TaggedInputSplit is a hadoop inner class ,you should make your class in the org.apache.hadoop.mapred.lib classspace: TaggedInputSplitGetName.java InputSplit is = reporter.getInputSplit(); String name = is.getClass().getName(); if ( name.compareTo( "org.apache.hadoop.mapred.FileSplit" ) == 0 ) { FileSplit fs = (FileSplit)is; String path = fs.getPath().toString(); word.set(path); output.collect(word, one); } if ( name.compareTo( "org.apache.hadoop.mapred.lib.TaggedInputSplit" ) == 0 ) { TaggedInputSplit tis = (TaggedInputSplit)is; InputSplit iis = tis.getInputSplit(); String iname = iis.getClass().getName(); word.set(iname); output.collect(word, one); if ( iname.compareTo( "org.apache.hadoop.mapred.FileSplit" ) == 0 ) { FileSplit fs = (FileSplit)iis; // the path from the TaggedInputSplit should be prefixed by "convert: " String path = "convert: " + fs.getPath().toString(); word.set(path); output.collect(word, one); } } and the output file give me : {noformat} $ grep 'convert' testout/part-00000 |head -n 5 convert: hdfs: //myowndir/pt=20100513000000/attempt_201003291206_327196_r_000000_0 1 convert: hdfs: //myowndir/pt=20100513000000/attempt_201003291206_327196_r_000001_0 1 convert: hdfs: //myowndir/pt=20100513000000/attempt_201003291206_327196_r_000002_0 1 convert: hdfs: //myowndir/pt=20100513000000/attempt_201003291206_327196_r_000003_0 1 convert: hdfs: //myowndir/pt=20100513000000/attempt_201003291206_327196_r_000004_0 1 {noformat} you may give it a try .
        Hide
        Jim Donofrio added a comment -

        private void updateJobWithSplit(final JobConf job, InputSplit inputSplit) {
        if (inputSplit instanceof FileSplit)

        { FileSplit fileSplit = (FileSplit) inputSplit; job.set(JobContext.MAP_INPUT_FILE, fileSplit.getPath().toString()); job.setLong(JobContext.MAP_INPUT_START, fileSplit.getStart()); job.setLong(JobContext.MAP_INPUT_PATH, fileSplit.getLength()); }

        else if (inputSplit instanceof TaggedInputSplit)

        { updateJobWithSplit(job, ((TaggedInputSplit) inputSplit).getInputSplit()); }

        }

        If make TaggedInputSplit a public class instead of package private then you can change updateJobWithSplit to simply call itself again if the inputSplit is a TaggedInputSplit.

        Show
        Jim Donofrio added a comment - private void updateJobWithSplit(final JobConf job, InputSplit inputSplit) { if (inputSplit instanceof FileSplit) { FileSplit fileSplit = (FileSplit) inputSplit; job.set(JobContext.MAP_INPUT_FILE, fileSplit.getPath().toString()); job.setLong(JobContext.MAP_INPUT_START, fileSplit.getStart()); job.setLong(JobContext.MAP_INPUT_PATH, fileSplit.getLength()); } else if (inputSplit instanceof TaggedInputSplit) { updateJobWithSplit(job, ((TaggedInputSplit) inputSplit).getInputSplit()); } } If make TaggedInputSplit a public class instead of package private then you can change updateJobWithSplit to simply call itself again if the inputSplit is a TaggedInputSplit.
        Hide
        Liyin Liang added a comment -

        Jim's solution is nice.

        Show
        Liyin Liang added a comment - Jim's solution is nice.
        Hide
        Haijia Zhou added a comment -

        Hi, I'm facing the same issue in project. I just wonder how Luo and Jim's solution can work because TaggedInputSplit is not a public class.
        Thanks

        Show
        Haijia Zhou added a comment - Hi, I'm facing the same issue in project. I just wonder how Luo and Jim's solution can work because TaggedInputSplit is not a public class. Thanks
        Hide
        Liyin Liang added a comment -

        Attach a patch based on branch-1.1 with Jim's solution.This patch works well in our production cluster.

        Show
        Liyin Liang added a comment - Attach a patch based on branch-1.1 with Jim's solution.This patch works well in our production cluster.
        Hide
        Hans Uhlig added a comment -

        Has this been applied upstream?

        Show
        Hans Uhlig added a comment - Has this been applied upstream?

          People

          • Assignee:
            Liyin Liang
            Reporter:
            Yuanyuan Tian
          • Votes:
            3 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:

              Development