Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6558

multibyte delimiters with compressed input files generate duplicate records

    Details

      Description

      This is the follow up for MAPREDUCE-6549. Compressed files cause record duplications as shown in different junit tests. The number of duplicated records changes with the splitsize:

      Unexpected number of records in split (splitsize = 10)
      Expected: 41051
      Actual: 45062

      Unexpected number of records in split (splitsize = 100000)
      Expected: 41051
      Actual: 41052

      Test passes with splitsize = 147445 which is the compressed file length.The file is a bzip2 file with 100k blocks and a total of 11 blocks

        Attachments

        1. MAPREDUCE-6558.3.patch
          7 kB
          Wilfred Spiegelenburg
        2. MAPREDUCE-6558.2.patch
          16 kB
          Wilfred Spiegelenburg
        3. MAPREDUCE-6558.1.patch
          322 kB
          Wilfred Spiegelenburg

          Issue Links

            Activity

              People

              • Assignee:
                wilfreds Wilfred Spiegelenburg
                Reporter:
                wilfreds Wilfred Spiegelenburg
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: