Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-8655

In TextInputFormat, while specifying textinputformat.record.delimiter the character/character sequences in data file similar to starting character/starting character sequence in delimiter were found missing in certain cases in the Map Output

    XMLWordPrintableJSON

Details

    • hadoop, mapreduce

    Description

      Set textinputformat.record.delimiter as "</entity>"

      Suppose the input is a text file with the following content
      <entity><id>1</id><name>User1</name></entity><entity><id>2</id><name>User2</name></entity><entity><id>3</id><name>User3</name></entity><entity><id>4</id><name>User4</name></entity><entity><id>5</id><name>User5</name></entity>

      Mapper was expected to get value as

      Value 1 - <entity><id>1</id><name>User1</name>
      Value 2 - <entity><id>2</id><name>User2</name>
      Value 3 - <entity><id>3</id><name>User3</name>
      Value 4 - <entity><id>4</id><name>User4</name>
      Value 5 - <entity><id>5</id><name>User5</name>

      According to this bug Mapper gets value

      Value 1 - entity><id>1</id><name>User1</name>
      Value 2 - <entity>id>2</id><name>User2</name>
      Value 3 - <entity><id>3id><name>User3</name>
      Value 4 - <entity><id>4</id><name>User4name>
      Value 5 - <entity><id>5</id><name>User5</name>

      The pattern shown above need not occur for value 1,2,3 necessarily. The bug occurs at some random positions in the map input.

      Attachments

        1. MAPREDUCE-4519.patch
          4 kB
          Meria Joseph
        2. HADOOP-8655 (2).patch
          11 kB
          Gelesh
        3. HADOOP-8655.patch
          10 kB
          Gelesh
        4. HADOOP-8655.patch
          10 kB
          Gelesh
        5. HADOOP-8655.patch
          11 kB
          Gelesh

        Activity

          People

            Unassigned Unassigned
            ak.arun@aol.com Arun A K
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 168h
                168h
                Remaining:
                Remaining Estimate - 168h
                168h
                Logged:
                Time Spent - Not Specified
                Not Specified