Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6549

multibyte delimiters with LineRecordReader cause duplicate records

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      LineRecorderReader currently produces duplicate records under certain scenarios such as:

      1) input string: "abc++defghi+"
      delimiter string: "+++"
      test passes with all sizes of the split
      2) input string: "abc+def+ghi+"
      delimiter string: "+++"
      test fails with a split size of 4
      2) input string: "abc++defghi+"
      delimiter string: "++"
      test fails with a split size of 5
      3) input string "abc++defghij+"
      delimiter string: "++"
      test fails with a split size of 4
      4) input string "abc+def+ghi+"
      delimiter string: "++"
      test fails with a split size of 9

      Attachments

        1. MAPREDUCE-6549.3.patch
          31 kB
          wilfreds#1
        2. MAPREDUCE-6549-2.patch
          30 kB
          wilfreds#1
        3. MAPREDUCE-6549-1.patch
          3 kB
          Dustin Cote

        Issue Links

          Activity

            People

              wilfreds Wilfred Spiegelenburg
              cotedm Dustin Cote
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: