Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6558

multibyte delimiters with compressed input files generate duplicate records

    XMLWordPrintableJSON

Details

    Description

      This is the follow up for MAPREDUCE-6549. Compressed files cause record duplications as shown in different junit tests. The number of duplicated records changes with the splitsize:

      Unexpected number of records in split (splitsize = 10)
      Expected: 41051
      Actual: 45062

      Unexpected number of records in split (splitsize = 100000)
      Expected: 41051
      Actual: 41052

      Test passes with splitsize = 147445 which is the compressed file length.The file is a bzip2 file with 100k blocks and a total of 11 blocks

      Attachments

        1. MAPREDUCE-6558.3.patch
          7 kB
          wilfreds#1
        2. MAPREDUCE-6558.2.patch
          16 kB
          wilfreds#1
        3. MAPREDUCE-6558.1.patch
          322 kB
          wilfreds#1

        Issue Links

          Activity

            People

              wilfreds Wilfred Spiegelenburg
              wilfreds Wilfred Spiegelenburg
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: