Hadoop Common
  1. Hadoop Common
  2. HADOOP-7096

Allow setting of end-of-record delimiter for TextInputFormat

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2.0, 0.23.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The patch for https://issues.apache.org/jira/browse/MAPREDUCE-2254 required minor changes to the LineReader class to allow extensions (see attached 2.patch). Description copied below:

      It will be useful to allow setting the end-of-record delimiter for TextInputFormat. The current implementation hardcodes '\n', '\r' or '\r\n' as the only possible record delimiters. This is a problem if users have embedded newlines in their data fields (which is pretty common). This is also a problem for other tools using this TextInputFormat (See for example: https://issues.apache.org/jira/browse/PIG-836 and https://issues.cloudera.org/browse/SQOOP-136).
      I have wrote a patch to address this issue. This patch allows users to specify any custom end-of-record delimiter using a new added configuration property. For backward compatibility, if this new configuration property is absent, then the same exact previous delimiters are used (i.e., '\n', '\r' or '\r\n').

      1. HADOOP-7096.patch
        0.8 kB
        Ahmed Radwan
      2. HADOOP-7096_r2.patch
        9 kB
        Ahmed Radwan
      3. HADOOP-7096_r3.patch
        12 kB
        Ahmed Radwan
      4. hadoop-7096_r4.patch
        6 kB
        Todd Lipcon
      5. hadoop-7096.branch-1.patch
        7 kB
        Suresh Srinivas

        Issue Links

          Activity

          Hide
          Matt Foley added a comment -

          Added 1.2.0 to fixversion, per Dec commit.

          Show
          Matt Foley added a comment - Added 1.2.0 to fixversion, per Dec commit.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 the branch-1 patch looks good.

          Show
          Tsz Wo Nicholas Sze added a comment - +1 the branch-1 patch looks good.
          Hide
          Suresh Srinivas added a comment -

          Attaching a patch that cleanly applies to branch-1.

          Show
          Suresh Srinivas added a comment - Attaching a patch that cleanly applies to branch-1.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk #601 (See https://hudson.apache.org/hudson/job/Hadoop-Common-trunk/601/)
          HADOOP-7096. Allow setting of end-of-record delimiter for TextInputFormat. Contributed by Ahmed Radwan.

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk #601 (See https://hudson.apache.org/hudson/job/Hadoop-Common-trunk/601/ ) HADOOP-7096 . Allow setting of end-of-record delimiter for TextInputFormat. Contributed by Ahmed Radwan.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #497 (See https://hudson.apache.org/hudson/job/Hadoop-Common-trunk-Commit/497/)

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #497 (See https://hudson.apache.org/hudson/job/Hadoop-Common-trunk-Commit/497/ )
          Hide
          Todd Lipcon added a comment -

          Committed to trunk, thanks Ahmed!

          Show
          Todd Lipcon added a comment - Committed to trunk, thanks Ahmed!
          Hide
          Todd Lipcon added a comment -

          +1 then. I plan on committing this in the next day or two unless there are any objections.

          Show
          Todd Lipcon added a comment - +1 then. I plan on committing this in the next day or two unless there are any objections.
          Hide
          Ahmed Radwan added a comment -

          I have revised it, it looks good for me. Thanks Todd.

          Show
          Ahmed Radwan added a comment - I have revised it, it looks good for me. Thanks Todd.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12470517/hadoop-7096_r4.patch
          against trunk revision 1066284.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/223//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/223//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/223//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12470517/hadoop-7096_r4.patch against trunk revision 1066284. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/223//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/223//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/223//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Ahmed's patch accidentally reformatted other parts of the file. Here's the same code but formatted according to our guidelines.

          I also changed the new methods to private instead of public and modified the javadoc a little bit. Can you check that these changes look good to you, Ahmed?

          Show
          Todd Lipcon added a comment - Ahmed's patch accidentally reformatted other parts of the file. Here's the same code but formatted according to our guidelines. I also changed the new methods to private instead of public and modified the javadoc a little bit. Can you check that these changes look good to you, Ahmed?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12470509/HADOOP-7096_r3.patch
          against trunk revision 1066284.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/221//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/221//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/221//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12470509/HADOOP-7096_r3.patch against trunk revision 1066284. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/221//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/221//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/221//console This message is automatically generated.
          Hide
          Ahmed Radwan added a comment -

          Thanks Todd for your comments. I have addressed them, please review the attached revised patch. For the test case, I previously added a LineRecordReader testcase that covers the cases when using user-specified and default delimiters.

          Show
          Ahmed Radwan added a comment - Thanks Todd for your comments. I have addressed them, please review the attached revised patch. For the test case, I previously added a LineRecordReader testcase that covers the cases when using user-specified and default delimiters.
          Hide
          Todd Lipcon added a comment -

          A few style nits:

          • after recordDelimiterBytes, need a blank line before the javadoc starts
          • the comment on this line:
                } else { //recordDelimiterBytes != null, use default delimite
            

            seems like it's backwards - don't you mean "== null" in the else clause? Also typo "delimite"

          • Worth considering splitting readLine into two methods, one for the true case, one for the false case
          • Can you make recordDelimiterBytes final?
          • In the MAX_VALUE case, for custom delimiter, the IOE should say "Too many bytes before delimiter", rather than "before newline"

          Although there aren't currently any unit tests to test LineReader in Common, it would be really great if you had time to add a couple. Otherwise we can probably consider this as being tested by the new unit inputformat tests in MAPREDUCE-2254.

          Show
          Todd Lipcon added a comment - A few style nits: after recordDelimiterBytes, need a blank line before the javadoc starts the comment on this line: } else { //recordDelimiterBytes != null , use default delimite seems like it's backwards - don't you mean "== null" in the else clause? Also typo "delimite" Worth considering splitting readLine into two methods, one for the true case, one for the false case Can you make recordDelimiterBytes final? In the MAX_VALUE case, for custom delimiter, the IOE should say "Too many bytes before delimiter", rather than "before newline" Although there aren't currently any unit tests to test LineReader in Common, it would be really great if you had time to add a couple. Otherwise we can probably consider this as being tested by the new unit inputformat tests in MAPREDUCE-2254 .
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12469722/HADOOP-7096_r2.patch
          against trunk revision 1064919.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/208//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/208//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/208//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12469722/HADOOP-7096_r2.patch against trunk revision 1064919. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/208//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/208//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/208//console This message is automatically generated.
          Hide
          Ahmed Radwan added a comment -

          The INFRA-3357 reviewboard issue I mentioned above was resolved (the three new git repositories were added). I Uploaded the diff files to the reviewboard for easier review.
          https://reviews.apache.org/r/329/
          https://reviews.apache.org/r/328/

          Show
          Ahmed Radwan added a comment - The INFRA-3357 reviewboard issue I mentioned above was resolved (the three new git repositories were added). I Uploaded the diff files to the reviewboard for easier review. https://reviews.apache.org/r/329/ https://reviews.apache.org/r/328/
          Hide
          Ahmed Radwan added a comment -

          The test case is included with the main patch (MAPREDUCE-2254). No test required here, since it is just changing few private fields to protected to allow extension in the main patch.

          Show
          Ahmed Radwan added a comment - The test case is included with the main patch ( MAPREDUCE-2254 ). No test required here, since it is just changing few private fields to protected to allow extension in the main patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12468214/HADOOP-7096.patch
          against trunk revision 1058343.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/179//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/179//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/179//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12468214/HADOOP-7096.patch against trunk revision 1058343. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/179//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/179//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/179//console This message is automatically generated.
          Hide
          Ahmed Radwan added a comment -

          The review board has a problem with uploading git diff files. Attached an updated file.

          Show
          Ahmed Radwan added a comment - The review board has a problem with uploading git diff files. Attached an updated file.
          Hide
          Ahmed Radwan added a comment -

          Used --no-prefix when generating the git diffs to solve the apply patch command problem.

          Show
          Ahmed Radwan added a comment - Used --no-prefix when generating the git diffs to solve the apply patch command problem.
          Hide
          Ahmed Radwan added a comment -

          Updated the patch to remove the extra prefixes (generated by git diff) causing the problem with patch command.

          Uploaded the patch to review board for easier review.
          https://reviews.apache.org/r/312/

          Show
          Ahmed Radwan added a comment - Updated the patch to remove the extra prefixes (generated by git diff) causing the problem with patch command. Uploaded the patch to review board for easier review. https://reviews.apache.org/r/312/
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12467949/2.patch
          against trunk revision 1057455.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/168//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12467949/2.patch against trunk revision 1057455. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/168//console This message is automatically generated.

            People

            • Assignee:
              Ahmed Radwan
              Reporter:
              Ahmed Radwan
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development