Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
Reviewed
Description
By default, hadoop streaming uses TAB as the separator in all places. However in some environments, user may want to use customized separators (e.g, ^A = \u0001).
The separator logic in hadoop streaming is very convoluted. Here is a brief summary:
InputFormat {
KeyValueLineRecordReader.java:59:
S1: String sepStr = job.get("key.value.separator.in.input.line", "\t");
}
Mapper {
PipeMapper.java:88:
S2: clientOut_.write('\t');
Call mapper process
PipeMapRed.java:124:
S3: String mapOutputFieldSeparator = job_.get("stream.map.output.field.separator", "\t");
PipeMapRed.java:128:
this.numOfMapOutputKeyFields = job_.getInt("stream.num.map.output.key.fields", 1);
}
Reducer {
PipeReducer.java:78:
S4: clientOut_.write('\t');
Call reducer process
PipeMapRed.java:125:
S5: String reduceOutputFieldSeparator = job_.get("stream.reduce.output.field.separator", "\t");
PipeMapRed.java:129:
this.numOfReduceOutputKeyFields = job_.getInt("stream.num.reduce.output.key.fields", 1);
}
OutputFormat {
TextOuputFormat.java:112:
S6: String keyValueSeparator = job.get("mapred.textoutputformat.separator", "\t");
}
Short-cuts:
1. In case we use "TextInputFormat", S1 and S2 are not used at all. Lines are directly feed into the mapper (through the value part of the key-value pair - keys, which are offsets, are directly ignored).
2. For jobs with no reducers, The "Reducer" step is skipped.
We need to make S3 and S4 configurable, possibly under the following names for conformity:
stream.map.input.field.separator
stream.reduce.input.field.separator
Then, by specifying: -jobconf key.value.separator.in.input.line=^A -jobconf stream.map.input.field.separator=^A -jobconf stream.map.output.field.separator=^A -jobconf stream.reducer.input.field.separator=^A -jobconf stream.reducer.output.field.separator=^A -jobconf mapred.textoutputformat.separator=^A, we will be able to use ^A instead of TAB in every place!
Maybe hadoop streaming can also provide a single option to override these 6 options.
Attachments
Attachments
Issue Links
- duplicates
-
MAPREDUCE-598 Streaming: better conrol over input splits
- Resolved