Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2254

Allow setting of end-of-record delimiter for TextInputFormat

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.23.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      TextInputFormat may now split lines with delimiters other than newline, by specifying a configuration parameter "textinputformat.record.delimiter"

      Description

      It will be useful to allow setting the end-of-record delimiter for TextInputFormat. The current implementation hardcodes '\n', '\r' or '\r\n' as the only possible record delimiters. This is a problem if users have embedded newlines in their data fields (which is pretty common). This is also a problem for other tools using this TextInputFormat (See for example: https://issues.apache.org/jira/browse/PIG-836 and https://issues.cloudera.org/browse/SQOOP-136).

      I have wrote a patch to address this issue. This patch allows users to specify any custom end-of-record delimiter using a new added configuration property. For backward compatibility, if this new configuration property is absent, then the same exact previous delimiters are used (i.e., '\n', '\r' or '\r\n').

      1. MAPREDUCE-2245.patch
        10 kB
        Ahmed Radwan
      2. MAPREDUCE-2254_r2.patch
        7 kB
        Ahmed Radwan
      3. MAPREDUCE-2254_r3.patch
        8 kB
        Ahmed Radwan

        Issue Links

          Activity

          Hide
          Michele Giusto added a comment -

          Hi everybody, I believe there is a bug when the custom record delimiter is longer than 1 character. For example using as delimiter "#$&" and having the line "...record1#$&record2#$&record3#$&..." divided between 2 consecutive input splits, with the second input split beginning after the first "$" (so it starts with "&record2#$&record3#$&..."), "record2" will not not be read.
          This is due to the fact that the mapper that processes the second split starts reading from the last character of the first input split (so the "$"), then it looses the delimiter between "record1" and "record2". In this way the constructor of the mapper tries to skip the last line of the previous input split but it instead skips the first line of its one and reports "record3" as the first line.
          If you agree that this is a bug, a possible solution may be to modify the LineRecordReader class to start reading each input split (except the first one) not from the last character of the previous input split but going back a number of characters equals to the number of characters of the record delimiter (3 in my example).

          Show
          Michele Giusto added a comment - Hi everybody, I believe there is a bug when the custom record delimiter is longer than 1 character. For example using as delimiter "#$&" and having the line "...record1#$&record2#$&record3#$&..." divided between 2 consecutive input splits, with the second input split beginning after the first "$" (so it starts with "&record2#$&record3#$&..."), "record2" will not not be read. This is due to the fact that the mapper that processes the second split starts reading from the last character of the first input split (so the "$"), then it looses the delimiter between "record1" and "record2". In this way the constructor of the mapper tries to skip the last line of the previous input split but it instead skips the first line of its one and reports "record3" as the first line. If you agree that this is a bug, a possible solution may be to modify the LineRecordReader class to start reading each input split (except the first one) not from the last character of the previous input split but going back a number of characters equals to the number of characters of the record delimiter (3 in my example).
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #643 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk/643/)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #643 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk/643/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #613 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/613/)
          MAPREDUCE-2254. Allow setting of end-of-record delimiter for TextInputFormat. Contributed by Ahmed Radwan.

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #613 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/613/ ) MAPREDUCE-2254 . Allow setting of end-of-record delimiter for TextInputFormat. Contributed by Ahmed Radwan.
          Hide
          Todd Lipcon added a comment -

          Committed to trunk, thanks Ahmed!

          Show
          Todd Lipcon added a comment - Committed to trunk, thanks Ahmed!
          Hide
          Todd Lipcon added a comment -

          All unit tests passed. Will commit momentarily.

          Show
          Todd Lipcon added a comment - All unit tests passed. Will commit momentarily.
          Hide
          Todd Lipcon added a comment -

          test-patch results:
          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 2 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          [exec]
          [exec] +1 system test framework. The patch passed system test framework compile.

          kicking off unit tests now.

          Show
          Todd Lipcon added a comment - test-patch results: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 2 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 system test framework. The patch passed system test framework compile. kicking off unit tests now.
          Hide
          Ahmed Radwan added a comment -

          Adding the Apache license header to the test file.

          Show
          Ahmed Radwan added a comment - Adding the Apache license header to the test file.
          Hide
          Todd Lipcon added a comment -

          looks good except one minor nit - can you add the apache license to the new test file?

          Show
          Todd Lipcon added a comment - looks good except one minor nit - can you add the apache license to the new test file?
          Hide
          Ahmed Radwan added a comment -

          Thanks Todd for your comments. I have moved the changes to the LineReader, with new constructors for the user-specified line delimiter. If old constructors are used, then the exact old behavior is preserved. I have attached the new revised patches.

          Show
          Ahmed Radwan added a comment - Thanks Todd for your comments. I have moved the changes to the LineReader, with new constructors for the user-specified line delimiter. If old constructors are used, then the exact old behavior is preserved. I have attached the new revised patches.
          Hide
          Todd Lipcon added a comment -

          It seems you're being inconsistent here - why is it that LineReader shouldn't take an arbitrary delimiter but LineRecordReader should? What I mean here is that either the concept of a "line" is a sequence of characters with a newline, or it's a sequence of characters with an arbitrary delimiter. If "line" means something with a newline, then maybe this new feature should go in a new class like DelimitedTextInputFormat or something? If "line" really could be delimited by anything, then I would support moving this support up to LineReader, with a different constructor. That way at least the similar code will be next to each other.

          It just smells really bad to me to extend a class and then reimplement its only nontrivial method. Maybe we could alternatively extract an interface here?

          Show
          Todd Lipcon added a comment - It seems you're being inconsistent here - why is it that LineReader shouldn't take an arbitrary delimiter but LineRecordReader should? What I mean here is that either the concept of a "line" is a sequence of characters with a newline, or it's a sequence of characters with an arbitrary delimiter. If "line" means something with a newline, then maybe this new feature should go in a new class like DelimitedTextInputFormat or something? If "line" really could be delimited by anything, then I would support moving this support up to LineReader, with a different constructor. That way at least the similar code will be next to each other. It just smells really bad to me to extend a class and then reimplement its only nontrivial method. Maybe we could alternatively extract an interface here?
          Hide
          Ahmed Radwan added a comment -

          The INFRA-3357 issue I mentioned above was resolved (the three new git repositories were added). I Uploaded the diff files to the reviewboard for easier review.
          https://reviews.apache.org/r/329/
          https://reviews.apache.org/r/328/

          Show
          Ahmed Radwan added a comment - The INFRA-3357 issue I mentioned above was resolved (the three new git repositories were added). I Uploaded the diff files to the reviewboard for easier review. https://reviews.apache.org/r/329/ https://reviews.apache.org/r/328/
          Hide
          Ahmed Radwan added a comment -

          Hi Todd. I agree that the changes can directly go to the LineReader. My motive was keeping the LineReader mostly unchanged, in case it is used in other contexts. The LineReader breaks the input stream using new lines, which is totally fine and it exactly does what its name suggests. This is why I thought of encapsulating the changes within the RecordReader (where conceptually these changes are required). However, I see your point that it looks a little weird. I can move the changes to LineReader but then its name will not convey its functionality, and if we rename it, this can cause other problems. What do you think?

          Show
          Ahmed Radwan added a comment - Hi Todd. I agree that the changes can directly go to the LineReader. My motive was keeping the LineReader mostly unchanged, in case it is used in other contexts. The LineReader breaks the input stream using new lines, which is totally fine and it exactly does what its name suggests. This is why I thought of encapsulating the changes within the RecordReader (where conceptually these changes are required). However, I see your point that it looks a little weird. I can move the changes to LineReader but then its name will not convey its functionality, and if we rename it, this can cause other problems. What do you think?
          Hide
          Todd Lipcon added a comment -

          Hi Ahmed. It looks like GenericLineReader essentially reimplements most of LineReader by overriding the readLine method. It seems strange to me to extend a class only to reimplement the majority of it. Could this patch be done by just modifying LineReader itself to be more generic, perhaps keeping the CRLF implementation as a private method for "fast path"?

          Show
          Todd Lipcon added a comment - Hi Ahmed. It looks like GenericLineReader essentially reimplements most of LineReader by overriding the readLine method. It seems strange to me to extend a class only to reimplement the majority of it. Could this patch be done by just modifying LineReader itself to be more generic, perhaps keeping the CRLF implementation as a private method for "fast path"?
          Hide
          Ahmed Radwan added a comment -

          See INFRA-3357 for the problem of uploading git diff files to the review board.

          Show
          Ahmed Radwan added a comment - See INFRA-3357 for the problem of uploading git diff files to the review board.
          Hide
          Ahmed Radwan added a comment -

          I have attached the patches. The review board has a problem with uploading git diff files.

          Show
          Ahmed Radwan added a comment - I have attached the patches. The review board has a problem with uploading git diff files.
          Hide
          Ahmed Radwan added a comment -

          The updated patches.
          Added a new test case, and used --no-prefix when generating the git diff files.

          Show
          Ahmed Radwan added a comment - The updated patches. Added a new test case, and used --no-prefix when generating the git diff files.
          Hide
          Ahmed Radwan added a comment -

          Uploaded the patch to review board for easier review.
          https://reviews.apache.org/r/293/
          https://reviews.apache.org/r/312/

          I have also added a new test case to the patch.

          Show
          Ahmed Radwan added a comment - Uploaded the patch to review board for easier review. https://reviews.apache.org/r/293/ https://reviews.apache.org/r/312/ I have also added a new test case to the patch.
          Hide
          Ahmed Radwan added a comment -

          1.patch applies on hadoop-mapreduce
          2.patch applies on hadoop-common

          Show
          Ahmed Radwan added a comment - 1.patch applies on hadoop-mapreduce 2.patch applies on hadoop-common

            People

            • Assignee:
              Ahmed Radwan
              Reporter:
              Ahmed Radwan
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development