Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6558

multibyte delimiters with compressed input files generate duplicate records

    Details

      Description

      This is the follow up for MAPREDUCE-6549. Compressed files cause record duplications as shown in different junit tests. The number of duplicated records changes with the splitsize:

      Unexpected number of records in split (splitsize = 10)
      Expected: 41051
      Actual: 45062

      Unexpected number of records in split (splitsize = 100000)
      Expected: 41051
      Actual: 41052

      Test passes with splitsize = 147445 which is the compressed file length.The file is a bzip2 file with 100k blocks and a total of 11 blocks

      1. MAPREDUCE-6558.3.patch
        7 kB
        Wilfred Spiegelenburg
      2. MAPREDUCE-6558.2.patch
        16 kB
        Wilfred Spiegelenburg
      3. MAPREDUCE-6558.1.patch
        322 kB
        Wilfred Spiegelenburg

        Issue Links

          Activity

          Hide
          djp Junping Du added a comment -

          Hi, can we move this out of 2.6.3? Thanks!

          Show
          djp Junping Du added a comment - Hi, can we move this out of 2.6.3? Thanks!
          Hide
          wilfreds Wilfred Spiegelenburg added a comment -

          I do not see a problem with that. I am still working on the fix for this.

          Show
          wilfreds Wilfred Spiegelenburg added a comment - I do not see a problem with that. I am still working on the fix for this.
          Hide
          djp Junping Du added a comment -

          Thanks Wilfred Spiegelenburg, moving this to 2.6.4.

          Show
          djp Junping Du added a comment - Thanks Wilfred Spiegelenburg , moving this to 2.6.4.
          Hide
          djp Junping Du added a comment -

          Move it to 2.6.5 as 2.6.4 is almost there.

          Show
          djp Junping Du added a comment - Move it to 2.6.5 as 2.6.4 is almost there.
          Hide
          Markovich Rivkin Andrey added a comment -

          Hi, Just run into this problem.

          Multibyte delimites with *.Bz2 files.
          We are on Hadoop 2.5.0, tried to get LineRecordReader from 2.6.4 - didn't help.
          Can you please sugest some workaround or where to look (why this problem occurs?)?

          Regards,
          Andrey

          Show
          Markovich Rivkin Andrey added a comment - Hi, Just run into this problem. Multibyte delimites with *.Bz2 files. We are on Hadoop 2.5.0, tried to get LineRecordReader from 2.6.4 - didn't help. Can you please sugest some workaround or where to look (why this problem occurs?)? Regards, Andrey
          Hide
          wilfreds Wilfred Spiegelenburg added a comment -

          A patch with test input that fails before the fix was made and passes after the fix was made.

          I have run all the tests that were there and they all still pass with this change. The tests have been run a large number of times with different input splits and none of them have failed.

          The use cases that passed before the fix was applied have been documented in the code.

          Show
          wilfreds Wilfred Spiegelenburg added a comment - A patch with test input that fails before the fix was made and passes after the fix was made. I have run all the tests that were there and they all still pass with this change. The tests have been run a large number of times with different input splits and none of them have failed. The use cases that passed before the fix was applied have been documented in the code.
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 13s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
          +1 mvninstall 6m 38s trunk passed
          +1 compile 0m 21s trunk passed with JDK v1.8.0_91
          +1 compile 0m 23s trunk passed with JDK v1.7.0_95
          +1 checkstyle 0m 17s trunk passed
          +1 mvnsite 0m 30s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 findbugs 1m 2s trunk passed
          +1 javadoc 0m 20s trunk passed with JDK v1.8.0_91
          +1 javadoc 0m 24s trunk passed with JDK v1.7.0_95
          +1 mvninstall 0m 24s the patch passed
          +1 compile 0m 18s the patch passed with JDK v1.8.0_91
          +1 javac 0m 18s the patch passed
          +1 compile 0m 22s the patch passed with JDK v1.7.0_95
          +1 javac 0m 22s the patch passed
          +1 checkstyle 0m 15s the patch passed
          +1 mvnsite 0m 27s the patch passed
          +1 mvneclipse 0m 11s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 1m 12s the patch passed
          +1 javadoc 0m 18s the patch passed with JDK v1.8.0_91
          +1 javadoc 0m 21s the patch passed with JDK v1.7.0_95
          +1 unit 2m 13s hadoop-mapreduce-client-core in the patch passed with JDK v1.8.0_91.
          +1 unit 2m 30s hadoop-mapreduce-client-core in the patch passed with JDK v1.7.0_95.
          +1 asflicense 0m 17s Patch does not generate ASF License warnings.
          20m 7s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:cf2ee45
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12803264/MAPREDUCE-6558.1.patch
          JIRA Issue MAPREDUCE-6558
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 7ef0b4d4af7b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 0f0c641
          Default Java 1.7.0_95
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
          findbugs v3.0.0
          JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6491/testReport/
          modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6491/console
          Powered by Apache Yetus 0.2.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 13s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. +1 mvninstall 6m 38s trunk passed +1 compile 0m 21s trunk passed with JDK v1.8.0_91 +1 compile 0m 23s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 17s trunk passed +1 mvnsite 0m 30s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 1m 2s trunk passed +1 javadoc 0m 20s trunk passed with JDK v1.8.0_91 +1 javadoc 0m 24s trunk passed with JDK v1.7.0_95 +1 mvninstall 0m 24s the patch passed +1 compile 0m 18s the patch passed with JDK v1.8.0_91 +1 javac 0m 18s the patch passed +1 compile 0m 22s the patch passed with JDK v1.7.0_95 +1 javac 0m 22s the patch passed +1 checkstyle 0m 15s the patch passed +1 mvnsite 0m 27s the patch passed +1 mvneclipse 0m 11s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 1m 12s the patch passed +1 javadoc 0m 18s the patch passed with JDK v1.8.0_91 +1 javadoc 0m 21s the patch passed with JDK v1.7.0_95 +1 unit 2m 13s hadoop-mapreduce-client-core in the patch passed with JDK v1.8.0_91. +1 unit 2m 30s hadoop-mapreduce-client-core in the patch passed with JDK v1.7.0_95. +1 asflicense 0m 17s Patch does not generate ASF License warnings. 20m 7s Subsystem Report/Notes Docker Image:yetus/hadoop:cf2ee45 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12803264/MAPREDUCE-6558.1.patch JIRA Issue MAPREDUCE-6558 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 7ef0b4d4af7b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 0f0c641 Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6491/testReport/ modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6491/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks, Wilfred Spiegelenburg! Patch looks good overall.

          I think we can significantly reduce the size of the testcase file since the problem occurs early in it. I noticed that if we cut the file down to just 530 records instead of 20,000 records and compress with bzip2 -1 it still catches the failure but is only a 10K binary rather than a 409K binary.

          Show
          jlowe Jason Lowe added a comment - Thanks, Wilfred Spiegelenburg ! Patch looks good overall. I think we can significantly reduce the size of the testcase file since the problem occurs early in it. I noticed that if we cut the file down to just 530 records instead of 20,000 records and compress with bzip2 -1 it still catches the failure but is only a 10K binary rather than a 409K binary.
          Hide
          jlowe Jason Lowe added a comment -

          Apparently Jenkins is having trouble posting to JIRA. The precommit build was an overall +1 for the .3 patch. From https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6495/console:

          +1 overall
          
          | Vote |      Subsystem |  Runtime   | Comment
          ============================================================================
          |   0  |        reexec  |  0m 13s    | Docker mode activated. 
          |  +1  |       @author  |  0m 0s     | The patch does not contain any @author 
          |      |                |            | tags.
          |  +1  |    test4tests  |  0m 0s     | The patch appears to include 3 new or 
          |      |                |            | modified test files.
          |  +1  |    mvninstall  |  6m 39s    | trunk passed 
          |  +1  |       compile  |  0m 20s    | trunk passed with JDK v1.8.0_91 
          |  +1  |       compile  |  0m 24s    | trunk passed with JDK v1.7.0_101 
          |  +1  |    checkstyle  |  0m 17s    | trunk passed 
          |  +1  |       mvnsite  |  0m 30s    | trunk passed 
          |  +1  |    mvneclipse  |  0m 13s    | trunk passed 
          |  +1  |      findbugs  |  1m 1s     | trunk passed 
          |  +1  |       javadoc  |  0m 20s    | trunk passed with JDK v1.8.0_91 
          |  +1  |       javadoc  |  0m 26s    | trunk passed with JDK v1.7.0_101 
          |  +1  |    mvninstall  |  0m 24s    | the patch passed 
          |  +1  |       compile  |  0m 18s    | the patch passed with JDK v1.8.0_91 
          |  +1  |         javac  |  0m 18s    | the patch passed 
          |  +1  |       compile  |  0m 22s    | the patch passed with JDK v1.7.0_101 
          |  +1  |         javac  |  0m 22s    | the patch passed 
          |  +1  |    checkstyle  |  0m 15s    | the patch passed 
          |  +1  |       mvnsite  |  0m 27s    | the patch passed 
          |  +1  |    mvneclipse  |  0m 11s    | the patch passed 
          |  +1  |    whitespace  |  0m 0s     | Patch has no whitespace issues. 
          |  +1  |      findbugs  |  1m 12s    | the patch passed 
          |  +1  |       javadoc  |  0m 18s    | the patch passed with JDK v1.8.0_91 
          |  +1  |       javadoc  |  0m 23s    | the patch passed with JDK v1.7.0_101 
          |  +1  |          unit  |  1m 53s    | hadoop-mapreduce-client-core in the 
          |      |                |            | patch passed with JDK v1.8.0_91.
          |  +1  |          unit  |  2m 15s    | hadoop-mapreduce-client-core in the 
          |      |                |            | patch passed with JDK v1.7.0_101.
          |  +1  |    asflicense  |  0m 18s    | Patch does not generate ASF License 
          |      |                |            | warnings.
          |      |                |  19m 38s   | 
          
          
          || Subsystem || Report/Notes ||
          ============================================================================
          | Docker |  Image:yetus/hadoop:cf2ee45 |
          | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12803799/MAPREDUCE-6558.3.patch |
          | JIRA Issue | MAPREDUCE-6558 |
          | Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  unit  findbugs  checkstyle  |
          | uname | Linux 2b492d4fc64c 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
          | Build tool | maven |
          | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh |
          | git revision | trunk / 3c5c57a |
          | Default Java | 1.7.0_101 |
          | Multi-JDK versions |  /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_101 |
          | findbugs | v3.0.0 |
          | JDK v1.7.0_101  Test Results | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6495/testReport/ |
          | modules | C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core |
          | Console output | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6495/console |
          | Powered by | Apache Yetus 0.2.0   http://yetus.apache.org |
          

          +1, patch looks good to me as well. Committing this.

          Show
          jlowe Jason Lowe added a comment - Apparently Jenkins is having trouble posting to JIRA. The precommit build was an overall +1 for the .3 patch. From https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6495/console: +1 overall | Vote | Subsystem | Runtime | Comment ============================================================================ | 0 | reexec | 0m 13s | Docker mode activated. | +1 | @author | 0m 0s | The patch does not contain any @author | | | | tags. | +1 | test4tests | 0m 0s | The patch appears to include 3 new or | | | | modified test files. | +1 | mvninstall | 6m 39s | trunk passed | +1 | compile | 0m 20s | trunk passed with JDK v1.8.0_91 | +1 | compile | 0m 24s | trunk passed with JDK v1.7.0_101 | +1 | checkstyle | 0m 17s | trunk passed | +1 | mvnsite | 0m 30s | trunk passed | +1 | mvneclipse | 0m 13s | trunk passed | +1 | findbugs | 1m 1s | trunk passed | +1 | javadoc | 0m 20s | trunk passed with JDK v1.8.0_91 | +1 | javadoc | 0m 26s | trunk passed with JDK v1.7.0_101 | +1 | mvninstall | 0m 24s | the patch passed | +1 | compile | 0m 18s | the patch passed with JDK v1.8.0_91 | +1 | javac | 0m 18s | the patch passed | +1 | compile | 0m 22s | the patch passed with JDK v1.7.0_101 | +1 | javac | 0m 22s | the patch passed | +1 | checkstyle | 0m 15s | the patch passed | +1 | mvnsite | 0m 27s | the patch passed | +1 | mvneclipse | 0m 11s | the patch passed | +1 | whitespace | 0m 0s | Patch has no whitespace issues. | +1 | findbugs | 1m 12s | the patch passed | +1 | javadoc | 0m 18s | the patch passed with JDK v1.8.0_91 | +1 | javadoc | 0m 23s | the patch passed with JDK v1.7.0_101 | +1 | unit | 1m 53s | hadoop-mapreduce-client-core in the | | | | patch passed with JDK v1.8.0_91. | +1 | unit | 2m 15s | hadoop-mapreduce-client-core in the | | | | patch passed with JDK v1.7.0_101. | +1 | asflicense | 0m 18s | Patch does not generate ASF License | | | | warnings. | | | 19m 38s | || Subsystem || Report/Notes || ============================================================================ | Docker | Image:yetus/hadoop:cf2ee45 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12803799/MAPREDUCE-6558.3.patch | | JIRA Issue | MAPREDUCE-6558 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 2b492d4fc64c 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 3c5c57a | | Default Java | 1.7.0_101 | | Multi-JDK versions | /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_101 | | findbugs | v3.0.0 | | JDK v1.7.0_101 Test Results | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6495/testReport/ | | modules | C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core | | Console output | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6495/console | | Powered by | Apache Yetus 0.2.0 http://yetus.apache.org | +1, patch looks good to me as well. Committing this.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks, Wilfred Spiegelenburg! I committed this to trunk, branch-2, branch-2.8, branch-2.7, and branch-2.6.

          Show
          jlowe Jason Lowe added a comment - Thanks, Wilfred Spiegelenburg ! I committed this to trunk, branch-2, branch-2.8, branch-2.7, and branch-2.6.
          Hide
          wilfreds Wilfred Spiegelenburg added a comment -

          I somehow could not leave a comment yesterday. I made the .3 patch to fix some comments in the test code and decrease the size of the test file even further.
          Thank you for the review and the commit Jason Lowe

          Show
          wilfreds Wilfred Spiegelenburg added a comment - I somehow could not leave a comment yesterday. I made the .3 patch to fix some comments in the test code and decrease the size of the test file even further. Thank you for the review and the commit Jason Lowe
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Closing the JIRA as part of 2.7.3 release.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Closing the JIRA as part of 2.7.3 release.

            People

            • Assignee:
              wilfreds Wilfred Spiegelenburg
              Reporter:
              wilfreds Wilfred Spiegelenburg
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development