Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6558

multibyte delimiters with compressed input files generate duplicate records

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      This is the follow up for MAPREDUCE-6549. Compressed files cause record duplications as shown in different junit tests. The number of duplicated records changes with the splitsize:

      Unexpected number of records in split (splitsize = 10)
      Expected: 41051
      Actual: 45062

      Unexpected number of records in split (splitsize = 100000)
      Expected: 41051
      Actual: 41052

      Test passes with splitsize = 147445 which is the compressed file length.The file is a bzip2 file with 100k blocks and a total of 11 blocks

      Attachments

        1. MAPREDUCE-6558.1.patch
          322 kB
          wilfreds#1
        2. MAPREDUCE-6558.2.patch
          16 kB
          wilfreds#1
        3. MAPREDUCE-6558.3.patch
          7 kB
          wilfreds#1

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            wilfreds Wilfred Spiegelenburg
            wilfreds Wilfred Spiegelenburg
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment