Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6549

multibyte delimiters with LineRecordReader cause duplicate records

    Details

    • Hadoop Flags:
      Reviewed

      Description

      LineRecorderReader currently produces duplicate records under certain scenarios such as:

      1) input string: "abc++defghi+"
      delimiter string: "+++"
      test passes with all sizes of the split
      2) input string: "abc+def+ghi+"
      delimiter string: "+++"
      test fails with a split size of 4
      2) input string: "abc++defghi+"
      delimiter string: "++"
      test fails with a split size of 5
      3) input string "abc++defghij+"
      delimiter string: "++"
      test fails with a split size of 4
      4) input string "abc+def+ghi+"
      delimiter string: "++"
      test fails with a split size of 9

        Attachments

        1. MAPREDUCE-6549.3.patch
          31 kB
          Wilfred Spiegelenburg
        2. MAPREDUCE-6549-1.patch
          3 kB
          Dustin Cote
        3. MAPREDUCE-6549-2.patch
          30 kB
          Wilfred Spiegelenburg

          Issue Links

            Activity

              People

              • Assignee:
                wilfreds Wilfred Spiegelenburg
                Reporter:
                cotedm Dustin Cote
              • Votes:
                0 Vote for this issue
                Watchers:
                11 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: