Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.2.0
    • Component/s: None
    • Labels:
      None

      Description

      This is a patch that adds a src/contrib/hadoopStreaming directory to the source tree.
      hadoopStreaming is a bridge to run non-Java code as Map/Reduce tasks.
      The unit test TestStreaming runs the Unix tools tr (as Map) and uniq (as Reduce)

      TO test the patch:
      Merge the patch.
      The only existing file that is modified is trunk/build.xml
      trunk>ant deploy-contrib
      trunk>bin/hadoopStreaming : should show usage message
      trunk>ant test-contrib : should run one test successfully

      TO add src/contrib/someOtherProject:
      edit src/contrib/build.xml

      1. streaming.patch
        99 kB
        Michel Tourn
      2. streaming.2.patch
        95 kB
        Michel Tourn
      3. streaming.3.patch
        33 kB
        Michel Tourn

        Activity

        Hide
        Michel Tourn added a comment -

        The usage message:

        hadoop-trunk>bin/hadoopStreaming

        Usage: hadoopStreaming [options]
        Options:
        -input <path> DFS input file(s) for the Map step
        -output <path> DFS output directory for the Reduce step
        -mapper <cmd> The streaming command to run
        -reducer <cmd> The streaming command to run
        -files <file> Additional files to be shipped in the Job jar file
        -cluster <name> Default uses hadoop-default.xml and hadoop-site.xml
        -config <file> Optional. One or more paths to xml config files
        -inputreader <spec> Optional. See below
        -verbose

        In -input: globbing on <path> is supported and can have multiple -input
        Default Map input format: a line is a record in UTF-8
        the key part ends at first TAB, the rest of the line is the value
        Custom Map input format: -inputreader package.MyRecordReader,n=v,n=v
        comma-separated name-values can be specified to configure the InputFormat
        Ex: -inputreader 'StreamXmlRecordReader,begin=<doc>,end=</doc>'
        Map output format, reduce input/output format:
        Format defined by what mapper command outputs. Line-oriented
        Mapper and Reducer <cmd> syntax:
        If the mapper or reducer programs are prefixed with noship: then
        the paths are assumed to be valid absolute paths on the task tracker machines
        and are NOT packaged with the Job jar file.
        Use -cluster <name> to switch between "local" Hadoop and one or more remote
        Hadoop clusters.
        The default is to use the normal hadoop-default.xml and hadoop-site.xml
        Else configuration will use $HADOOP_HOME/conf/hadoop-<name>.xml

        Example: hadoopStreaming -mapper "noship:/usr/local/bin/perl5 filter.pl"
        -files /local/filter.pl -input "/logs/0604*/*" [...]
        Ships a script, invokes the non-shipped perl interpreter
        Shipped files go to the working directory so filter.pl is found by perl
        Input files are all the daily logs for days in month 2006-04

        Show
        Michel Tourn added a comment - The usage message: hadoop-trunk>bin/hadoopStreaming Usage: hadoopStreaming [options] Options: -input <path> DFS input file(s) for the Map step -output <path> DFS output directory for the Reduce step -mapper <cmd> The streaming command to run -reducer <cmd> The streaming command to run -files <file> Additional files to be shipped in the Job jar file -cluster <name> Default uses hadoop-default.xml and hadoop-site.xml -config <file> Optional. One or more paths to xml config files -inputreader <spec> Optional. See below -verbose In -input: globbing on <path> is supported and can have multiple -input Default Map input format: a line is a record in UTF-8 the key part ends at first TAB, the rest of the line is the value Custom Map input format: -inputreader package.MyRecordReader,n=v,n=v comma-separated name-values can be specified to configure the InputFormat Ex: -inputreader 'StreamXmlRecordReader,begin=<doc>,end=</doc>' Map output format, reduce input/output format: Format defined by what mapper command outputs. Line-oriented Mapper and Reducer <cmd> syntax: If the mapper or reducer programs are prefixed with noship: then the paths are assumed to be valid absolute paths on the task tracker machines and are NOT packaged with the Job jar file. Use -cluster <name> to switch between "local" Hadoop and one or more remote Hadoop clusters. The default is to use the normal hadoop-default.xml and hadoop-site.xml Else configuration will use $HADOOP_HOME/conf/hadoop-<name>.xml Example: hadoopStreaming -mapper "noship:/usr/local/bin/perl5 filter.pl" -files /local/filter.pl -input "/logs/0604*/*" [...] Ships a script, invokes the non-shipped perl interpreter Shipped files go to the working directory so filter.pl is found by perl Input files are all the daily logs for days in month 2006-04
        Hide
        Doug Cutting added a comment -

        Most of the changes to the top-level build.xml don't seem to be required, and a number are spurious whitespace and comment changes. It seems to build fine with only the new targets added.

        Also, is the new bin/ script required? Won't 'bin/hadoop jar build/hadoop-streaming.jar ...' suffice? (You'll need to set the "Main-Class" attribute in the jar's manifest.)

        Show
        Doug Cutting added a comment - Most of the changes to the top-level build.xml don't seem to be required, and a number are spurious whitespace and comment changes. It seems to build fine with only the new targets added. Also, is the new bin/ script required? Won't 'bin/hadoop jar build/hadoop-streaming.jar ...' suffice? (You'll need to set the "Main-Class" attribute in the jar's manifest.)
        Hide
        Michel Tourn added a comment -

        >top-level build.xml :
        OK, I can remove the unnecessary changes.
        Which contrib targets would you keep in?
        I mimicked deploy-contrib, test-contrib, clean-contrib on Nutch plugins.
        (It is true that for now the new targets are not required since the nightly target does not call them.)

        >bin/hadoop jar build/hadoop-streaming.jar ...
        Looks cleaner. I'll try to do it this way

        Show
        Michel Tourn added a comment - >top-level build.xml : OK, I can remove the unnecessary changes. Which contrib targets would you keep in? I mimicked deploy-contrib, test-contrib, clean-contrib on Nutch plugins. (It is true that for now the new targets are not required since the nightly target does not call them.) >bin/hadoop jar build/hadoop-streaming.jar ... Looks cleaner. I'll try to do it this way
        Hide
        Doug Cutting added a comment -

        I'm okay with all the contrib targets. But all of the other changes to that file seem spurious. The new properties are unused and the directory created will be created by another build script anyway.

        Show
        Doug Cutting added a comment - I'm okay with all the contrib targets. But all of the other changes to that file seem spurious. The new properties are unused and the directory created will be created by another build script anyway.
        Hide
        Michel Tourn added a comment -

        Updated patch:

        1. top-level build.xml has 3 contrib targets and no other changes.

        2. script hadoopStreaming is gone.
        new Usage message:
        bin/hadoop jar build/hadoop-streaming.jar [options]

        3. removed some spurious exec permissions on source files

        Show
        Michel Tourn added a comment - Updated patch: 1. top-level build.xml has 3 contrib targets and no other changes. 2. script hadoopStreaming is gone. new Usage message: bin/hadoop jar build/hadoop-streaming.jar [options] 3. removed some spurious exec permissions on source files
        Hide
        Doug Cutting added a comment -

        I just committed this. It looks great! Thanks, Michel.

        Show
        Doug Cutting added a comment - I just committed this. It looks great! Thanks, Michel.
        Hide
        Michel Tourn added a comment -

        An update to hadoop-streaming.

        Show
        Michel Tourn added a comment - An update to hadoop-streaming.
        Hide
        Michel Tourn added a comment -

        This patch depends on the LargeUTF8 patch: http://issues.apache.org/jira/browse/HADOOP-136

        Added a few more configurable options.

        michel@cdev2004> bin/hadoop jar build/hadoop-streaming.jar -info
        Usage: $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar [options]
        Options:
        -input <path> DFS input file(s) for the Map step
        -output <path> DFS output directory for the Reduce step
        -mapper <cmd> The streaming command to run
        -combiner <cmd> Not implemented. But you can pipe the mapper output
        -reducer <cmd> The streaming command to run
        -file <file> File/dir to be shipped in the Job jar file
        -cluster <name> Default uses hadoop-default.xml and hadoop-site.xml
        -config <file> Optional. One or more paths to xml config files
        -dfs <h> Optional. Override DFS configuration
        -jt <h> Optional. Override JobTracker configuration
        -inputreader <spec> Optional.
        -jobconf <n>=<v> Optional.
        -cmdenv <n>=<v> Optional. Pass env.var to streaming commands
        -verbose

        For more details about these options:
        Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info

        Show
        Michel Tourn added a comment - This patch depends on the LargeUTF8 patch: http://issues.apache.org/jira/browse/HADOOP-136 Added a few more configurable options. michel@cdev2004> bin/hadoop jar build/hadoop-streaming.jar -info Usage: $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar [options] Options: -input <path> DFS input file(s) for the Map step -output <path> DFS output directory for the Reduce step -mapper <cmd> The streaming command to run -combiner <cmd> Not implemented. But you can pipe the mapper output -reducer <cmd> The streaming command to run -file <file> File/dir to be shipped in the Job jar file -cluster <name> Default uses hadoop-default.xml and hadoop-site.xml -config <file> Optional. One or more paths to xml config files -dfs <h > Optional. Override DFS configuration -jt <h > Optional. Override JobTracker configuration -inputreader <spec> Optional. -jobconf <n>=<v> Optional. -cmdenv <n>=<v> Optional. Pass env.var to streaming commands -verbose For more details about these options: Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info

          People

          • Assignee:
            Doug Cutting
            Reporter:
            Michel Tourn
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development