Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1277

Streaming job should support other characterset in user's stderr log, not only utf8

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.21.0
    • Fix Version/s: None
    • Component/s: contrib/streaming
    • Labels:
      None
    • Tags:
      streaming utf8

      Description

      Current implementation in streaming only support utf8 encoded user stderr log, it should encode free to support other characterset.

      1. streaming-1277.patch
        0.6 kB
        ZhuGuanyin
      2. streaming-1277-new.patch
        0.7 kB
        ZhuGuanyin

        Activity

        Hide
        ZhuGuanyin added a comment -

        test case:
        using the following mapper, and you would see the stderr log has corrupted.

        #!/bin/sh
        cat >/dev/null

        echo "㊣ ?※" >&2
        echo "礙骯襖壩闆辦" >&2

        Show
        ZhuGuanyin added a comment - test case: using the following mapper, and you would see the stderr log has corrupted. #!/bin/sh cat >/dev/null echo "㊣ ?※" >&2 echo "礙骯襖壩闆辦" >&2
        Hide
        ZhuGuanyin added a comment -

        a simple solution:

        change line 492 in PipeMapRed.java

        System.err.println(lineStr);

        to:
        System.err.write(line.getBytes(),0,line.getLength());
        System.err.println();

        I will attach the patch soon.

        Show
        ZhuGuanyin added a comment - a simple solution: change line 492 in PipeMapRed.java System.err.println(lineStr); to: System.err.write(line.getBytes(),0,line.getLength()); System.err.println(); I will attach the patch soon.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12427460/streaming-1277.patch
        against trunk revision 888761.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/309/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427460/streaming-1277.patch against trunk revision 888761. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/309/console This message is automatically generated.
        Hide
        ZhuGuanyin added a comment -

        this patch change

        System.err.println(lineStr);

        to

        System.err.write(line.getBytes(),0,line.getLength());
        System.err.println();

        I think it could be verified by review, and it not very easy to write a testcase for this jira.

        manual steps to check this :

        1)copy a small file to hdfs

        2)run streaming job using the mapper as follows:

        #!/bin/sh
        cat >/dev/null

        echo "㊣ ?※" >&2
        echo "礙骯襖壩闆辦" >&2

        3) check the task stderr output, the logs would corrupted.

        4) add the patch, and run the streaming job again, the task stderr would be fine.

        this patch is usefull when user need write some debug message, example: some input record which might be encoded by big5, GBK and so on.

        Show
        ZhuGuanyin added a comment - this patch change System.err.println(lineStr); to System.err.write(line.getBytes(),0,line.getLength()); System.err.println(); I think it could be verified by review, and it not very easy to write a testcase for this jira. manual steps to check this : 1)copy a small file to hdfs 2)run streaming job using the mapper as follows: #!/bin/sh cat >/dev/null echo "㊣ ?※" >&2 echo "礙骯襖壩闆辦" >&2 3) check the task stderr output, the logs would corrupted. 4) add the patch, and run the streaming job again, the task stderr would be fine. this patch is usefull when user need write some debug message, example: some input record which might be encoded by big5, GBK and so on.
        Hide
        Zheng Shao added a comment -

        I think a better way to do this is to add an "encoding" property to JobConf so that we can encode and decode the data correctly.
        That also allows us to do codec changes if needed.

        Does that make sense?

        Show
        Zheng Shao added a comment - I think a better way to do this is to add an "encoding" property to JobConf so that we can encode and decode the data correctly. That also allows us to do codec changes if needed. Does that make sense?
        Hide
        ZhuGuanyin added a comment -

        I think the framework should not care what the characterset of the input and user log, may be the input or output has more than one characterset.

        what hadoop need to do is read raw data for user mapper or reducer, collect raw stdout and stderr data and save them on hdfs or tasktracker local disk.

        raw in, raw out, no matter what characterset it is.

        Show
        ZhuGuanyin added a comment - I think the framework should not care what the characterset of the input and user log, may be the input or output has more than one characterset. what hadoop need to do is read raw data for user mapper or reducer, collect raw stdout and stderr data and save them on hdfs or tasktracker local disk. raw in, raw out, no matter what characterset it is.
        Hide
        Zheng Shao added a comment -

        Hadoop does need to understand the data format in stdout to split the records and key/value inside the record.
        By default, Hadoop streaming uses utf-8, "\n" and "\t".

        For stderr, Hadoop needs to know the line boundary "\n" as well. Hadoop already supports reporting (change of counters etc) through stderr.

        As a result, I think it's a better idea to specify the encoding of the streams.

        Show
        Zheng Shao added a comment - Hadoop does need to understand the data format in stdout to split the records and key/value inside the record. By default, Hadoop streaming uses utf-8, "\n" and "\t". For stderr, Hadoop needs to know the line boundary "\n" as well. Hadoop already supports reporting (change of counters etc) through stderr. As a result, I think it's a better idea to specify the encoding of the streams.
        Hide
        ZhuGuanyin added a comment -

        regenerate the patch using svn diff at root dir, thanks.

        Show
        ZhuGuanyin added a comment - regenerate the patch using svn diff at root dir, thanks.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12428009/streaming-1277-new.patch
        against trunk revision 891524.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/210/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/210/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/210/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/210/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12428009/streaming-1277-new.patch against trunk revision 891524. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/210/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/210/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/210/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/210/console This message is automatically generated.
        Hide
        Chris Douglas added a comment -

        Running through Hudson, again

        Show
        Chris Douglas added a comment - Running through Hudson, again
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12428009/streaming-1277-new.patch
        against trunk revision 902272.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/284/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/284/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/284/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/284/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12428009/streaming-1277-new.patch against trunk revision 902272. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/284/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/284/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/284/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/284/console This message is automatically generated.
        Hide
        Chris Douglas added a comment -

        I agree with Zheng; setting the encoding explicitly would be preferred to pushing bytes.

        Show
        Chris Douglas added a comment - I agree with Zheng; setting the encoding explicitly would be preferred to pushing bytes.

          People

          • Assignee:
            ZhuGuanyin
            Reporter:
            ZhuGuanyin
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development