[HADOOP-3341] make key-value separators in hadoop streaming fully configurable - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.19.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

By default, hadoop streaming uses TAB as the separator in all places. However in some environments, user may want to use customized separators (e.g, ^A = \u0001).

The separator logic in hadoop streaming is very convoluted. Here is a brief summary:

InputFormat {
KeyValueLineRecordReader.java:59:
S1: String sepStr = job.get("key.value.separator.in.input.line", "\t");
}

Mapper {
PipeMapper.java:88:
S2: clientOut_.write('\t');

Call mapper process

PipeMapRed.java:124:
S3: String mapOutputFieldSeparator = job_.get("stream.map.output.field.separator", "\t");
PipeMapRed.java:128:
this.numOfMapOutputKeyFields = job_.getInt("stream.num.map.output.key.fields", 1);
}

Reducer {
PipeReducer.java:78:
S4: clientOut_.write('\t');

Call reducer process

PipeMapRed.java:125:
S5: String reduceOutputFieldSeparator = job_.get("stream.reduce.output.field.separator", "\t");
PipeMapRed.java:129:
this.numOfReduceOutputKeyFields = job_.getInt("stream.num.reduce.output.key.fields", 1);
}

OutputFormat {
TextOuputFormat.java:112:
S6: String keyValueSeparator = job.get("mapred.textoutputformat.separator", "\t");
}

Short-cuts:
1. In case we use "TextInputFormat", S1 and S2 are not used at all. Lines are directly feed into the mapper (through the value part of the key-value pair - keys, which are offsets, are directly ignored).
2. For jobs with no reducers, The "Reducer" step is skipped.

We need to make S3 and S4 configurable, possibly under the following names for conformity:
stream.map.input.field.separator
stream.reduce.input.field.separator

Then, by specifying: -jobconf key.value.separator.in.input.line=^A -jobconf stream.map.input.field.separator=^A -jobconf stream.map.output.field.separator=^A -jobconf stream.reducer.input.field.separator=^A -jobconf stream.reducer.output.field.separator=^A -jobconf mapred.textoutputformat.separator=^A, we will be able to use ^A instead of TAB in every place!

Maybe hadoop streaming can also provide a single option to override these 6 options.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

3341-5.patch
26/Jun/08 19:51
22 kB
Zheng Shao
3341-4.patch
19/May/08 22:35
15 kB
Zheng Shao
3341-3.patch
09/May/08 22:09
10 kB
Zheng Shao
3341-2.patch
09/May/08 19:33
6 kB
Zheng Shao
3341-1.patch
06/May/08 02:26
3 kB
Zheng Shao

Issue Links

duplicates

MAPREDUCE-598 Streaming: better conrol over input splits

Resolved

Activity

People

Assignee:: Zheng Shao

Reporter:: Zheng Shao

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/May/08 20:00

Updated:: 17/Jul/14 18:47

Resolved:: 30/Jun/08 16:24