Issue Details (XML | Word | Printable)

Key: HADOOP-3341
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Zheng Shao
Reporter: Zheng Shao
Votes: 0
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

make key-value separators in hadoop streaming fully configurable

Created: 02/May/08 08:00 PM   Updated: 08/Jul/09 05:05 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.19.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 3341-1.patch 2008-05-06 02:26 AM Zheng Shao 3 kB
Text File Licensed for inclusion in ASF works 3341-2.patch 2008-05-09 07:33 PM Zheng Shao 6 kB
Text File Licensed for inclusion in ASF works 3341-3.patch 2008-05-09 10:09 PM Zheng Shao 10 kB
Text File Licensed for inclusion in ASF works 3341-4.patch 2008-05-19 10:35 PM Zheng Shao 15 kB
Text File Licensed for inclusion in ASF works 3341-5.patch 2008-06-26 07:51 PM Zheng Shao 22 kB

Hadoop Flags: Reviewed
Resolution Date: 30/Jun/08 04:24 PM


 Description  « Hide
By default, hadoop streaming uses TAB as the separator in all places. However in some environments, user may want to use customized separators (e.g, ^A = \u0001).

The separator logic in hadoop streaming is very convoluted. Here is a brief summary:

InputFormat {
KeyValueLineRecordReader.java:59:
S1: String sepStr = job.get("key.value.separator.in.input.line", "\t");
}

Mapper {
PipeMapper.java:88:
S2: clientOut_.write('\t');

Call mapper process

PipeMapRed.java:124:
S3: String mapOutputFieldSeparator = job_.get("stream.map.output.field.separator", "\t");
PipeMapRed.java:128:
this.numOfMapOutputKeyFields = job_.getInt("stream.num.map.output.key.fields", 1);
}

Reducer {
PipeReducer.java:78:
S4: clientOut_.write('\t');

Call reducer process

PipeMapRed.java:125:
S5: String reduceOutputFieldSeparator = job_.get("stream.reduce.output.field.separator", "\t");
PipeMapRed.java:129:
this.numOfReduceOutputKeyFields = job_.getInt("stream.num.reduce.output.key.fields", 1);
}

OutputFormat {
TextOuputFormat.java:112:
S6: String keyValueSeparator = job.get("mapred.textoutputformat.separator", "\t");
}

Short-cuts:
1. In case we use "TextInputFormat", S1 and S2 are not used at all. Lines are directly feed into the mapper (through the value part of the key-value pair - keys, which are offsets, are directly ignored).
2. For jobs with no reducers, The "Reducer" step is skipped.

We need to make S3 and S4 configurable, possibly under the following names for conformity:
stream.map.input.field.separator
stream.reduce.input.field.separator

Then, by specifying: -jobconf key.value.separator.in.input.line=^A -jobconf stream.map.input.field.separator=^A -jobconf stream.map.output.field.separator=^A -jobconf stream.reducer.input.field.separator=^A -jobconf stream.reducer.output.field.separator=^A -jobconf mapred.textoutputformat.separator=^A, we will be able to use ^A instead of TAB in every place!

Maybe hadoop streaming can also provide a single option to override these 6 options.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Hadoop QA added a comment - 06/May/08 05:12 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381472/3341-1.patch
against trunk revision 653638.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2404/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2404/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2404/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2404/console

This message is automatically generated.


Owen O'Malley added a comment - 07/May/08 10:52 PM
This looks good, except that the data fields should be down in PipeMapper and PipeReducer, respectively. They should also be made private. You can configure them in the PipeMapper and PipeReducer configure methods. Please also include a test for the change.

Zheng Shao added a comment - 07/May/08 10:57 PM
I agree it's cleaner to push the data fields down, but the current code already has the mapOutputFieldSeparator and the reduceOutputFieldSeparator in PipeMapRed.

Shall I move those two existing fields as well?

I will add the test once we got consensus on this question.


Owen O'Malley added a comment - 07/May/08 11:12 PM
Yes, please move the other fields down also.

Hadoop QA added a comment - 09/May/08 09:19 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381784/3341-2.patch
against trunk revision 654897.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2440/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2440/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2440/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2440/console

This message is automatically generated.


Owen O'Malley added a comment - 09/May/08 09:50 PM
This looks good, except for the findbugs warning for the unused numOfMapOutputPartitionFields. This should have a test case also.

Zheng Shao added a comment - 09/May/08 10:09 PM
Test included.

Hadoop QA added a comment - 10/May/08 01:24 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381801/3341-3.patch
against trunk revision 654973.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 8 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2443/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2443/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2443/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2443/console

This message is automatically generated.


Zheng Shao added a comment - 12/May/08 06:36 PM
Owen, any chance you can take a look at this today? This is a pretty simple case and hopefully it wouldn't take more than 15 minutes.

Owen O'Malley added a comment - 19/May/08 10:27 PM
I don't see any test case for the new stuff here. You would need to create a junit test that sets these attributes and validates the output.

Zheng Shao added a comment - 19/May/08 10:35 PM
Forgot to do "svn add" for the new test file last time.

Hadoop QA added a comment - 20/May/08 01:16 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12382335/3341-4.patch
against trunk revision 658035.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 11 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2502/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2502/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2502/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2502/console

This message is automatically generated.


Zheng Shao added a comment - 20/May/08 01:37 AM
Sorry for the problem, Owen. Will you take a look again?

Owen O'Malley added a comment - 22/May/08 12:40 AM
I'm sorry Zheng, but I think it would be much better if you used Strings instead of characters for the separators. That would prevent problems like someone putting in "", which would currently get a runtime error, or "xy", which would just silently use "x". I apologize for not catching it sooner.

We should probably file a separate jira for doing the same to TextOutputFormat.


Zheng Shao added a comment - 26/Jun/08 07:51 PM
Allow multi-character separators in streaming map/reduce input/output.

Hadoop QA added a comment - 26/Jun/08 09:57 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12384785/3341-5.patch
against trunk revision 671563.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 11 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2752/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2752/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2752/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2752/console

This message is automatically generated.


Zheng Shao added a comment - 26/Jun/08 10:09 PM
The "Test results" showed all tests are passed - why "Hadoop QA" says it's failing "contrib tests"?

Tsz Wo (Nicholas), SZE added a comment - 26/Jun/08 10:21 PM
> The "Test results" showed all tests are passed - why "Hadoop QA" says it's failing "contrib tests"?

I think there is some wrong in Hudson. I saw this on quite a few Hudson reports.


Owen O'Malley added a comment - 30/Jun/08 04:24 PM
I just committed this. Thanks, Zheng!

Hudson added a comment - 01/Jul/08 12:56 PM