Hadoop Common
  1. Hadoop Common
  2. HADOOP-3295

Allow TextOutputFormat to use configurable separators

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.18.0
    • Component/s: io
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      TextOutputFormat use hardcoded tab as key-value separator. We should allow configurable separators like ^A, etc.

      1. 3295-2.patch
        7 kB
        Zheng Shao
      2. 3295.patch
        3 kB
        Zheng Shao

        Issue Links

          Activity

          Hide
          Zheng Shao added a comment -

          Can you open a separate jira and mark this one as related? Then we can discuss from there and produce a fix.

          Show
          Zheng Shao added a comment - Can you open a separate jira and mark this one as related? Then we can discuss from there and produce a fix.
          Hide
          Suhas Gogate added a comment -

          Feature added by this Jira has a problem while setting up some of the invalid xml characters e.g. ctrl-A e.g. mapred.textoutputformat.separator = "\u0001"

          e,g,
          String delim = "\u0001";
          Conf.set("mapred.textoutputformat.separator", delim);

          Job client serializes the jobconf with mapred.textoutputformat.separator set to "\u0001" (ctrl-A) and problem happens when it is de-serialized (read back) by job tracker, where it encounters invalid xml character.

          The test for this feature public : testFormatWithCustomSeparator() does not serialize the jobconf after adding the separator as ctrl-A and hence does not detect the specific problem.

          Here is an exception:

          08/12/06 01:40:50 INFO mapred.FileInputFormat: Total input paths to process : 1
          org.apache.hadoop.ipc.RemoteException: java.io.IOException:
          java.lang.RuntimeException: org.xml.sax.SAXParseException: Character reference "&#1" is an invalid XML
          character.
          at
          org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:961)
          at
          org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:864)
          at
          org.apache.hadoop.conf.Configuration.getProps(Configuration.java:832)
          at org.apache.hadoop.conf.Configuration.get(Configuration.java:291)
          at
          org.apache.hadoop.mapred.JobConf.getJobPriority(JobConf.java:1163)
          at
          org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:179)
          at
          org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1783)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at
          sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at
          sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)

          at org.apache.hadoop.ipc.Client.call(Client.java:715)
          at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
          at org.apache.hadoop.mapred.$Proxy1.submitJob(Unknown Source)
          at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)
          at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1026)
          at

          Show
          Suhas Gogate added a comment - Feature added by this Jira has a problem while setting up some of the invalid xml characters e.g. ctrl-A e.g. mapred.textoutputformat.separator = "\u0001" e,g, String delim = "\u0001"; Conf.set("mapred.textoutputformat.separator", delim); Job client serializes the jobconf with mapred.textoutputformat.separator set to "\u0001" (ctrl-A) and problem happens when it is de-serialized (read back) by job tracker, where it encounters invalid xml character. The test for this feature public : testFormatWithCustomSeparator() does not serialize the jobconf after adding the separator as ctrl-A and hence does not detect the specific problem. Here is an exception: 08/12/06 01:40:50 INFO mapred.FileInputFormat: Total input paths to process : 1 org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.RuntimeException: org.xml.sax.SAXParseException: Character reference "&#1" is an invalid XML character. at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:961) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:864) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:832) at org.apache.hadoop.conf.Configuration.get(Configuration.java:291) at org.apache.hadoop.mapred.JobConf.getJobPriority(JobConf.java:1163) at org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:179) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1783) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) at org.apache.hadoop.ipc.Client.call(Client.java:715) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy1.submitJob(Unknown Source) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1026) at
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #471 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/471/ )
          Hide
          Chris Douglas added a comment -

          I just committed this. Thanks, Zheng

          Show
          Chris Douglas added a comment - I just committed this. Thanks, Zheng
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12380797/3295-2.patch
          against trunk revision 645773.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 3 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2312/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2312/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2312/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2312/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12380797/3295-2.patch against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2312/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2312/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2312/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2312/console This message is automatically generated.
          Hide
          Zheng Shao added a comment -

          Added a test for customized separator.

          Added a constructor with the old prototype to make sure user code does not break because of the patch.

          Show
          Zheng Shao added a comment - Added a test for customized separator. Added a constructor with the old prototype to make sure user code does not break because of the patch.
          Hide
          Runping Qi added a comment -

          Note that you have made public api changes:

          public LineRecordWriter(DataOutputStream out)
          

          into

          public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
          

          It is a better to the keep the original one as an overloaded constructor:

          public LineRecordWriter(DataOutputStream out) {
              LineRecordWriter(out, "\t");
          }
          
          Show
          Runping Qi added a comment - Note that you have made public api changes: public LineRecordWriter(DataOutputStream out) into public LineRecordWriter(DataOutputStream out, String keyValueSeparator) { It is a better to the keep the original one as an overloaded constructor: public LineRecordWriter(DataOutputStream out) { LineRecordWriter(out, "\t" ); }
          Hide
          Owen O'Malley added a comment -

          Zheng, please include a test for the new functionality.

          Show
          Owen O'Malley added a comment - Zheng, please include a test for the new functionality.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12380659/3295.patch
          against trunk revision 645773.

          @author +1. The patch does not contain any @author tags.

          tests included -1. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2299/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2299/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2299/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2299/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12380659/3295.patch against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included -1. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2299/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2299/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2299/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2299/console This message is automatically generated.
          Hide
          Milind Bhandarkar added a comment -

          This is great !!!

          I have been requesting this for a long time !!!!

          Thanks Zheng !

          Committers, please please please take a serious look at this !

          Show
          Milind Bhandarkar added a comment - This is great !!! I have been requesting this for a long time !!!! Thanks Zheng ! Committers, please please please take a serious look at this !
          Hide
          Zheng Shao added a comment -

          This patch adds the configuration parameter.

          Show
          Zheng Shao added a comment - This patch adds the configuration parameter.

            People

            • Assignee:
              Zheng Shao
              Reporter:
              Zheng Shao
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development