Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6549

multibyte delimiters with LineRecordReader cause duplicate records

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      LineRecorderReader currently produces duplicate records under certain scenarios such as:

      1) input string: "abc++defghi+"
      delimiter string: "+++"
      test passes with all sizes of the split
      2) input string: "abc+def+ghi+"
      delimiter string: "+++"
      test fails with a split size of 4
      2) input string: "abc++defghi+"
      delimiter string: "++"
      test fails with a split size of 5
      3) input string "abc++defghij+"
      delimiter string: "++"
      test fails with a split size of 4
      4) input string "abc+def+ghi+"
      delimiter string: "++"
      test fails with a split size of 9

      Attachments

        1. MAPREDUCE-6549-2.patch
          30 kB
          wilfreds#1
        2. MAPREDUCE-6549-1.patch
          3 kB
          Dustin Cote
        3. MAPREDUCE-6549.3.patch
          31 kB
          wilfreds#1

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            wilfreds Wilfred Spiegelenburg
            cotedm Dustin Cote
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment