Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
Right now, the protocol between stream mapper/reducer and the framework is very inflexible.
The mapper/reducer generates line oriented output. The framework picks up line by line, and split
each line into a key/value pair. By default, the substring up to the first tab char is the key, and the
substring after the first tab char is the value.
However, in many cases, the application wants some control over how the pair is split.
Here, I'd like to introduce the following configuration variables for that:
1. "streaming.output.field.separator": the value will be the tab key, by default.
But the user can specify a different one (e.g. ':', or ', ', etc.)
A map output line can be considered as a list of fields separated by the separator.
2. "streaming.num.fields.for.mapout.key": the number of the first fields will be used the map output key
(and for sorting in the reduce side).
The default value is 1.
The rest of the fields will be used as the value. For example, I can specify the first 5 fields as my mapout key.
3. "streaming.num.fields.for.partitioning": Sometimes, I want to use fewer fields for partitioning to
achieve "primary/secondary" composite
key effect as proposed in HADOOP485. The default value is 1.
For example, I can set "streaming.num.fields.for.partitioning" to 3
and "streaming.num.fields.for.mapout.key" to 5.
This effectively amounts to saying that fields 4 and 5 are my secondary key.
With the above default values, it is compatible with the current behavior
while introducing a new desirable feature in a clean way.
Thoughts?