Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-3341

make key-value separators in hadoop streaming fully configurable

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      By default, hadoop streaming uses TAB as the separator in all places. However in some environments, user may want to use customized separators (e.g, ^A = \u0001).

      The separator logic in hadoop streaming is very convoluted. Here is a brief summary:

      InputFormat {
      KeyValueLineRecordReader.java:59:
      S1: String sepStr = job.get("key.value.separator.in.input.line", "\t");
      }

      Mapper {
      PipeMapper.java:88:
      S2: clientOut_.write('\t');

      Call mapper process

      PipeMapRed.java:124:
      S3: String mapOutputFieldSeparator = job_.get("stream.map.output.field.separator", "\t");
      PipeMapRed.java:128:
      this.numOfMapOutputKeyFields = job_.getInt("stream.num.map.output.key.fields", 1);
      }

      Reducer {
      PipeReducer.java:78:
      S4: clientOut_.write('\t');

      Call reducer process

      PipeMapRed.java:125:
      S5: String reduceOutputFieldSeparator = job_.get("stream.reduce.output.field.separator", "\t");
      PipeMapRed.java:129:
      this.numOfReduceOutputKeyFields = job_.getInt("stream.num.reduce.output.key.fields", 1);
      }

      OutputFormat {
      TextOuputFormat.java:112:
      S6: String keyValueSeparator = job.get("mapred.textoutputformat.separator", "\t");
      }

      Short-cuts:
      1. In case we use "TextInputFormat", S1 and S2 are not used at all. Lines are directly feed into the mapper (through the value part of the key-value pair - keys, which are offsets, are directly ignored).
      2. For jobs with no reducers, The "Reducer" step is skipped.

      We need to make S3 and S4 configurable, possibly under the following names for conformity:
      stream.map.input.field.separator
      stream.reduce.input.field.separator

      Then, by specifying: -jobconf key.value.separator.in.input.line=^A -jobconf stream.map.input.field.separator=^A -jobconf stream.map.output.field.separator=^A -jobconf stream.reducer.input.field.separator=^A -jobconf stream.reducer.output.field.separator=^A -jobconf mapred.textoutputformat.separator=^A, we will be able to use ^A instead of TAB in every place!

      Maybe hadoop streaming can also provide a single option to override these 6 options.

        Attachments

        1. 3341-5.patch
          22 kB
          Zheng Shao
        2. 3341-4.patch
          15 kB
          Zheng Shao
        3. 3341-3.patch
          10 kB
          Zheng Shao
        4. 3341-2.patch
          6 kB
          Zheng Shao
        5. 3341-1.patch
          3 kB
          Zheng Shao

          Issue Links

            Activity

              People

              • Assignee:
                zshao Zheng Shao
                Reporter:
                zshao Zheng Shao
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: