Hadoop Common
  1. Hadoop Common
  2. HADOOP-1722

Make streaming to handle non-utf8 byte array

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0.2, 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Streaming allows binary (or other non-UTF8) streams.

      Description

      Right now, the streaming framework expects the output sof the steam process (mapper or reducer) are line
      oriented UTF-8 text. This limit makes it impossible to use those programs whose outputs may be non-UTF-8
      (international encoding, or maybe even binary data). Streaming can overcome this limit by introducing a simple
      encoding protocol. For example, it can allow the mapper/reducer to hexencode its keys/values,
      the framework decodes them in the Java side.
      This way, as long as the mapper/reducer executables follow this encoding protocol,
      they can output arabitary bytearray and the streaming framework can handle them.

      1. HADOOP-1722.patch
        114 kB
        Klaas Bosteels
      2. HADOOP-1722-branch-0.18.patch
        152 kB
        Klaas Bosteels
      3. HADOOP-1722-branch-0.19.patch
        152 kB
        Klaas Bosteels
      4. HADOOP-1722-v0.20.1.patch
        153 kB
        Matthias Lehmann
      5. HADOOP-1722-v2.patch
        119 kB
        Klaas Bosteels
      6. HADOOP-1722-v3.patch
        119 kB
        Klaas Bosteels
      7. HADOOP-1722-v4.patch
        121 kB
        Klaas Bosteels
      8. HADOOP-1722-v4.patch
        119 kB
        Klaas Bosteels
      9. HADOOP-1722-v5.patch
        153 kB
        Klaas Bosteels
      10. HADOOP-1722-v6.patch
        153 kB
        Klaas Bosteels

        Issue Links

          Activity

          Jay Hacker made changes -
          Link This issue relates to MAPREDUCE-5018 [ MAPREDUCE-5018 ]
          Arun C Murthy made changes -
          Fix Version/s 1.0.2 [ 12320152 ]
          Tom White made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Rick Weber made changes -
          Link This issue is related to HADOOP-6901 [ HADOOP-6901 ]
          Amareshwari Sriramadasu made changes -
          Link This issue relates to MAPREDUCE-606 [ MAPREDUCE-606 ]
          Zheng Shao made changes -
          Link This issue relates to HIVE-708 [ HIVE-708 ]
          Matthias Lehmann made changes -
          Attachment HADOOP-1722-v0.20.1.patch [ 12422200 ]
          Robert Chansler made changes -
          Release Note binary communication formats added to Streaming Streaming allows binary (or other non-UTF8) streams.
          Description
          Right now, the streaming framework expects the output sof the steam process (mapper or reducer) are line
          oriented UTF-8 text. This limit makes it impossible to use those programs whose outputs may be non-UTF-8
           (international encoding, or maybe even binary data). Streaming can overcome this limit by introducing a simple
          encoding protocol. For example, it can allow the mapper/reducer to hexencode its keys/values,
          the framework decodes them in the Java side.
          This way, as long as the mapper/reducer executables follow this encoding protocol,
          they can output arabitary bytearray and the streaming framework can handle them.
          Right now, the streaming framework expects the output sof the steam process (mapper or reducer) are line
          oriented UTF-8 text. This limit makes it impossible to use those programs whose outputs may be non-UTF-8
           (international encoding, or maybe even binary data). Streaming can overcome this limit by introducing a simple
          encoding protocol. For example, it can allow the mapper/reducer to hexencode its keys/values,
          the framework decodes them in the Java side.
          This way, as long as the mapper/reducer executables follow this encoding protocol,
          they can output arabitary bytearray and the streaming framework can handle them.
          Owen O'Malley made changes -
          Component/s contrib/streaming [ 12310972 ]
          weimin zhu made changes -
          Comment [ The error occurred when using the -D option,the following is error message

          [hadoop@super03 hadoop-latest]$ hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar -input data -output result -mapper "wc -c" -numReduceTasks 0 -D stream.map.input=rawbytes
          09/03/06 13:18:31 ERROR streaming.StreamJob: Unexpected -D while processing -input|-output|-mapper|-combiner|-reducer|-file|-dfs|-jt|-additionalconfspec|-inputformat|-outputformat|-partitioner|-numReduceTasks|-inputreader|-mapdebug|-reducedebug|||-cacheFile|-cacheArchive|-io|-verbose|-info|-debug|-inputtagged|-help
          Usage: $HADOOP_HOME/bin/hadoop jar \
                    $HADOOP_HOME/hadoop-streaming.jar [options]
          Options:
            -input <path> DFS input file(s) for the Map step
            -output <path> DFS output directory for the Reduce step
            -mapper <cmd|JavaClassName> The streaming command to run
            -combiner <JavaClassName> Combiner has to be a Java class
            -reducer <cmd|JavaClassName> The streaming command to run
            -file <file> File/dir to be shipped in the Job jar file
            -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
            -outputformat TextOutputFormat(default)|JavaClassName Optional.
            -partitioner JavaClassName Optional.
            -numReduceTasks <num> Optional.
            -inputreader <spec> Optional.
            -cmdenv <n>=<v> Optional. Pass env.var to streaming commands
            -mapdebug <path> Optional. To run this script when a map task fails
            -reducedebug <path> Optional. To run this script when a reduce task fails
            -io <identifier> Optional.
            -verbose

          Generic options supported are
          -conf <configuration file> specify an application configuration file
          -D <property=value> use value for given property
          -fs <local|namenode:port> specify a namenode
          -jt <local|jobtracker:port> specify a job tracker
          -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
          -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
          -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.

          The general command line syntax is
          bin/hadoop command [genericOptions] [commandOptions]


          For more details about these options:
          Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info

          Streaming Command Failed! ]
          Klaas Bosteels made changes -
          Attachment HADOOP-1722-branch-0.19.patch [ 12401426 ]
          Klaas Bosteels made changes -
          Attachment HADOOP-1722-branch-0.18.patch [ 12400166 ]
          Devaraj Das made changes -
          Fix Version/s 0.21.0 [ 12313563 ]
          Resolution Fixed [ 1 ]
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Klaas Bosteels made changes -
          Hadoop Flags [Reviewed]
          Status In Progress [ 3 ] Patch Available [ 10002 ]
          Klaas Bosteels made changes -
          Status Patch Available [ 10002 ] In Progress [ 3 ]
          Klaas Bosteels made changes -
          Attachment HADOOP-1722-v6.patch [ 12400123 ]
          Klaas Bosteels made changes -
          Hadoop Flags [Reviewed]
          Status In Progress [ 3 ] Patch Available [ 10002 ]
          Release Note binary communication format added to Streaming binary communication formats added to Streaming
          Klaas Bosteels made changes -
          Attachment HADOOP-1722-v5.patch [ 12399650 ]
          Klaas Bosteels made changes -
          Status Patch Available [ 10002 ] In Progress [ 3 ]
          Klaas Bosteels made changes -
          Status In Progress [ 3 ] Patch Available [ 10002 ]
          Klaas Bosteels made changes -
          Attachment HADOOP-1722-v4.patch [ 12399457 ]
          Klaas Bosteels made changes -
          Status Patch Available [ 10002 ] In Progress [ 3 ]
          Klaas Bosteels made changes -
          Status In Progress [ 3 ] Patch Available [ 10002 ]
          Klaas Bosteels made changes -
          Status Patch Available [ 10002 ] In Progress [ 3 ]
          Klaas Bosteels made changes -
          Attachment HADOOP-1722-v4.patch [ 12398937 ]
          Klaas Bosteels made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Release Note binary communication format added to Streaming
          Hadoop Flags [Reviewed]
          Klaas Bosteels made changes -
          Assignee Christopher Zimmerman [ zim ] Klaas Bosteels [ klbostee ]
          Klaas Bosteels made changes -
          Attachment HADOOP-1722-v3.patch [ 12398901 ]
          Klaas Bosteels made changes -
          Attachment HADOOP-1722-v2.patch [ 12398826 ]
          Klaas Bosteels made changes -
          Attachment HADOOP-1722.patch [ 12398744 ]
          FM FLOCH made changes -
          Field Original Value New Value
          Assignee Christopher Zimmerman [ zim ]
          Runping Qi created issue -

            People

            • Assignee:
              Klaas Bosteels
              Reporter:
              Runping Qi
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development