Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2254

Allow setting of end-of-record delimiter for TextInputFormat

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.23.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      TextInputFormat may now split lines with delimiters other than newline, by specifying a configuration parameter "textinputformat.record.delimiter"

      Description

      It will be useful to allow setting the end-of-record delimiter for TextInputFormat. The current implementation hardcodes '\n', '\r' or '\r\n' as the only possible record delimiters. This is a problem if users have embedded newlines in their data fields (which is pretty common). This is also a problem for other tools using this TextInputFormat (See for example: https://issues.apache.org/jira/browse/PIG-836 and https://issues.cloudera.org/browse/SQOOP-136).

      I have wrote a patch to address this issue. This patch allows users to specify any custom end-of-record delimiter using a new added configuration property. For backward compatibility, if this new configuration property is absent, then the same exact previous delimiters are used (i.e., '\n', '\r' or '\r\n').

      1. MAPREDUCE-2245.patch
        10 kB
        Ahmed Radwan
      2. MAPREDUCE-2254_r2.patch
        7 kB
        Ahmed Radwan
      3. MAPREDUCE-2254_r3.patch
        8 kB
        Ahmed Radwan

        Issue Links

          Activity

          Ahmed Radwan created issue -
          Ahmed Radwan made changes -
          Field Original Value New Value
          Attachment 1.patch [ 12467947 ]
          Attachment 2.patch [ 12467948 ]
          Aaron T. Myers made changes -
          Link This issue incorporates HADOOP-7096 [ HADOOP-7096 ]
          Ahmed Radwan made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Ahmed Radwan made changes -
          Description It will be useful to allow setting the end-of-record delimiter for TextInputFormat. The current implementation hardcodes '\n', '\r' or '\r\n' as the only possible record delimiters. This is a problem if users have impeded newlines in their data fields (which is pretty common). This is also a problem for other tools using this TextInputFormat (See for example: https://issues.apache.org/jira/browse/PIG-836 and https://issues.cloudera.org/browse/SQOOP-136).

          I have wrote a patch to address this issue. This patch allows users to specify any custom end-of-record delimiter using a new added configuration property. For backward compatibility, if this new configuration property is absent, then the same exact previous delimiters are used (i.e., '\n', '\r' or '\r\n').
          It will be useful to allow setting the end-of-record delimiter for TextInputFormat. The current implementation hardcodes '\n', '\r' or '\r\n' as the only possible record delimiters. This is a problem if users have embedded newlines in their data fields (which is pretty common). This is also a problem for other tools using this TextInputFormat (See for example: https://issues.apache.org/jira/browse/PIG-836 and https://issues.cloudera.org/browse/SQOOP-136).

          I have wrote a patch to address this issue. This patch allows users to specify any custom end-of-record delimiter using a new added configuration property. For backward compatibility, if this new configuration property is absent, then the same exact previous delimiters are used (i.e., '\n', '\r' or '\r\n').
          Ahmed Radwan made changes -
          Attachment MAPREDUCE-2245.patch [ 12468212 ]
          Attachment HADOOP-7096.patch [ 12468213 ]
          Ahmed Radwan made changes -
          Attachment 2.patch [ 12467948 ]
          Ahmed Radwan made changes -
          Attachment 1.patch [ 12467947 ]
          Ahmed Radwan made changes -
          Attachment HADOOP-7096.patch [ 12468213 ]
          Ahmed Radwan made changes -
          Attachment MAPREDUCE-2254_r2.patch [ 12469721 ]
          Ahmed Radwan made changes -
          Attachment MAPREDUCE-2254_r3.patch [ 12471118 ]
          Todd Lipcon made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Release Note TextInputFormat may now split lines with delimiters other than newline, by specifying a configuration parameter "textinputformat.record.delimiter"
          Assignee Ahmed Radwan [ ahmed.radwan ]
          Fix Version/s 0.23.0 [ 12315570 ]
          Resolution Fixed [ 1 ]
          Arun C Murthy made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Ahmed Radwan
              Reporter:
              Ahmed Radwan
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development