Hadoop Common
  1. Hadoop Common
  2. HADOOP-7096

Allow setting of end-of-record delimiter for TextInputFormat

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2.0, 0.23.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The patch for https://issues.apache.org/jira/browse/MAPREDUCE-2254 required minor changes to the LineReader class to allow extensions (see attached 2.patch). Description copied below:

      It will be useful to allow setting the end-of-record delimiter for TextInputFormat. The current implementation hardcodes '\n', '\r' or '\r\n' as the only possible record delimiters. This is a problem if users have embedded newlines in their data fields (which is pretty common). This is also a problem for other tools using this TextInputFormat (See for example: https://issues.apache.org/jira/browse/PIG-836 and https://issues.cloudera.org/browse/SQOOP-136).
      I have wrote a patch to address this issue. This patch allows users to specify any custom end-of-record delimiter using a new added configuration property. For backward compatibility, if this new configuration property is absent, then the same exact previous delimiters are used (i.e., '\n', '\r' or '\r\n').

      1. HADOOP-7096.patch
        0.8 kB
        Ahmed Radwan
      2. hadoop-7096.branch-1.patch
        7 kB
        Suresh Srinivas
      3. hadoop-7096_r4.patch
        6 kB
        Todd Lipcon
      4. HADOOP-7096_r3.patch
        12 kB
        Ahmed Radwan
      5. HADOOP-7096_r2.patch
        9 kB
        Ahmed Radwan

        Issue Links

          Activity

          Ahmed Radwan created issue -
          Ahmed Radwan made changes -
          Field Original Value New Value
          Attachment 2.patch [ 12467949 ]
          Aaron T. Myers made changes -
          Link This issue is part of MAPREDUCE-2254 [ MAPREDUCE-2254 ]
          Ahmed Radwan made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Ahmed Radwan made changes -
          Description The patch for https://issues.apache.org/jira/browse/MAPREDUCE-2254 required minor changes to the LineReader class to allow extensions (see attached 2.patch). Description copied below:

          It will be useful to allow setting the end-of-record delimiter for TextInputFormat. The current implementation hardcodes '\n', '\r' or '\r\n' as the only possible record delimiters. This is a problem if users have impeded newlines in their data fields (which is pretty common). This is also a problem for other tools using this TextInputFormat (See for example: https://issues.apache.org/jira/browse/PIG-836 and https://issues.cloudera.org/browse/SQOOP-136).
          I have wrote a patch to address this issue. This patch allows users to specify any custom end-of-record delimiter using a new added configuration property. For backward compatibility, if this new configuration property is absent, then the same exact previous delimiters are used (i.e., '\n', '\r' or '\r\n').

          The patch for https://issues.apache.org/jira/browse/MAPREDUCE-2254 required minor changes to the LineReader class to allow extensions (see attached 2.patch). Description copied below:

          It will be useful to allow setting the end-of-record delimiter for TextInputFormat. The current implementation hardcodes '\n', '\r' or '\r\n' as the only possible record delimiters. This is a problem if users have embedded newlines in their data fields (which is pretty common). This is also a problem for other tools using this TextInputFormat (See for example: https://issues.apache.org/jira/browse/PIG-836 and https://issues.cloudera.org/browse/SQOOP-136).
          I have wrote a patch to address this issue. This patch allows users to specify any custom end-of-record delimiter using a new added configuration property. For backward compatibility, if this new configuration property is absent, then the same exact previous delimiters are used (i.e., '\n', '\r' or '\r\n').

          Ahmed Radwan made changes -
          Attachment HADOOP-7096.patch [ 12468214 ]
          Ahmed Radwan made changes -
          Attachment 2.patch [ 12467949 ]
          Ahmed Radwan made changes -
          Attachment HADOOP-7096_r2.patch [ 12469722 ]
          Ahmed Radwan made changes -
          Attachment HADOOP-7096_r3.patch [ 12470509 ]
          Todd Lipcon made changes -
          Attachment hadoop-7096_r4.patch [ 12470517 ]
          Todd Lipcon made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Assignee Ahmed Radwan [ ahmed.radwan ]
          Fix Version/s 0.23.0 [ 12315569 ]
          Resolution Fixed [ 1 ]
          Arun C Murthy made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Suresh Srinivas made changes -
          Attachment hadoop-7096.branch-1.patch [ 12560507 ]
          Matt Foley made changes -
          Fix Version/s 1.2.0 [ 12321659 ]

            People

            • Assignee:
              Ahmed Radwan
              Reporter:
              Ahmed Radwan
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development