Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6633

AM should retry map attempts if the reduce task encounters commpression related errors.

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.7.2
    • Fix Version/s: 2.8.0, 2.7.3, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      When reduce task encounters compression related errors, AM doesn't retry the corresponding map task.
      In one of the case we encountered, here is the stack trace.

      2016-01-27 13:44:28,915 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#29
      	at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
      	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
      	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:422)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
      	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
      Caused by: java.lang.ArrayIndexOutOfBoundsException
      	at com.hadoop.compression.lzo.LzoDecompressor.setInput(LzoDecompressor.java:196)
      	at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:104)
      	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
      	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
      	at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.shuffle(InMemoryMapOutput.java:97)
      	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:537)
      	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:336)
      	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
      

      In this case, the node on which the map task ran had a bad drive.
      If the AM had retried running that map task somewhere else, the job definitely would have succeeded.

        Activity

        Hide
        shahrs87 Rushabh S Shah added a comment -

        In Fetcher#copyMapOutput method, I added Exception to catch block so that it will retry on any compression related Exception.

          try {
                // Go!
                LOG.info("fetcher#" + id + " about to shuffle output of map "
                    + mapOutput.getMapId() + " decomp: " + decompressedLength
                    + " len: " + compressedLength + " to " + mapOutput.getDescription());
                mapOutput.shuffle(host, is, compressedLength, decompressedLength,
                    metrics, reporter);
              } catch (java.lang.InternalError e) {
                LOG.warn("Failed to shuffle for fetcher#"+id, e);
                throw new IOException(e);
              }
        
        Show
        shahrs87 Rushabh S Shah added a comment - In Fetcher#copyMapOutput method, I added Exception to catch block so that it will retry on any compression related Exception. try { // Go! LOG.info("fetcher#" + id + " about to shuffle output of map " + mapOutput.getMapId() + " decomp: " + decompressedLength + " len: " + compressedLength + " to " + mapOutput.getDescription()); mapOutput.shuffle(host, is, compressedLength, decompressedLength, metrics, reporter); } catch (java.lang.InternalError e) { LOG.warn("Failed to shuffle for fetcher#"+id, e); throw new IOException(e); }
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 17s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 7m 36s trunk passed
        +1 compile 0m 25s trunk passed with JDK v1.8.0_74
        +1 compile 0m 27s trunk passed with JDK v1.7.0_95
        +1 checkstyle 0m 18s trunk passed
        +1 mvnsite 0m 39s trunk passed
        +1 mvneclipse 0m 15s trunk passed
        +1 findbugs 1m 16s trunk passed
        +1 javadoc 0m 29s trunk passed with JDK v1.8.0_74
        +1 javadoc 0m 31s trunk passed with JDK v1.7.0_95
        +1 mvninstall 0m 29s the patch passed
        +1 compile 0m 27s the patch passed with JDK v1.8.0_74
        +1 javac 0m 27s the patch passed
        +1 compile 0m 25s the patch passed with JDK v1.7.0_95
        +1 javac 0m 25s the patch passed
        +1 checkstyle 0m 16s the patch passed
        +1 mvnsite 0m 33s the patch passed
        +1 mvneclipse 0m 14s the patch passed
        +1 whitespace 0m 0s Patch has no whitespace issues.
        +1 findbugs 1m 25s the patch passed
        +1 javadoc 0m 27s the patch passed with JDK v1.8.0_74
        +1 javadoc 0m 27s the patch passed with JDK v1.7.0_95
        +1 unit 2m 20s hadoop-mapreduce-client-core in the patch passed with JDK v1.8.0_74.
        -1 unit 2m 37s hadoop-mapreduce-client-core in the patch failed with JDK v1.7.0_95.
        +1 asflicense 0m 20s Patch does not generate ASF License warnings.
        23m 20s



        Reason Tests
        JDK v1.7.0_95 Failed junit tests hadoop.mapreduce.tools.TestCLI



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:fbe3e86
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12795454/MAPREDUCE-6633.patch
        JIRA Issue MAPREDUCE-6633
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 8e3b6fa2658d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / e8fc81f
        Default Java 1.7.0_95
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_74 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6398/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core-jdk1.7.0_95.txt
        unit test logs https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6398/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core-jdk1.7.0_95.txt
        JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6398/testReport/
        modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core
        Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6398/console
        Powered by Apache Yetus 0.2.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 36s trunk passed +1 compile 0m 25s trunk passed with JDK v1.8.0_74 +1 compile 0m 27s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 18s trunk passed +1 mvnsite 0m 39s trunk passed +1 mvneclipse 0m 15s trunk passed +1 findbugs 1m 16s trunk passed +1 javadoc 0m 29s trunk passed with JDK v1.8.0_74 +1 javadoc 0m 31s trunk passed with JDK v1.7.0_95 +1 mvninstall 0m 29s the patch passed +1 compile 0m 27s the patch passed with JDK v1.8.0_74 +1 javac 0m 27s the patch passed +1 compile 0m 25s the patch passed with JDK v1.7.0_95 +1 javac 0m 25s the patch passed +1 checkstyle 0m 16s the patch passed +1 mvnsite 0m 33s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 1m 25s the patch passed +1 javadoc 0m 27s the patch passed with JDK v1.8.0_74 +1 javadoc 0m 27s the patch passed with JDK v1.7.0_95 +1 unit 2m 20s hadoop-mapreduce-client-core in the patch passed with JDK v1.8.0_74. -1 unit 2m 37s hadoop-mapreduce-client-core in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 20s Patch does not generate ASF License warnings. 23m 20s Reason Tests JDK v1.7.0_95 Failed junit tests hadoop.mapreduce.tools.TestCLI Subsystem Report/Notes Docker Image:yetus/hadoop:fbe3e86 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12795454/MAPREDUCE-6633.patch JIRA Issue MAPREDUCE-6633 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 8e3b6fa2658d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / e8fc81f Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_74 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6398/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6398/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core-jdk1.7.0_95.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6398/testReport/ modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6398/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
        Hide
        eepayne Eric Payne added a comment -

        Thanks Rushabh S Shah for reporting this issue and providing a patch.

        Overall, the patch looks good. I am a little nervous about re-fetching for any exception. If there is a runtime exception on the reducer (memory error, NPE, etc.), maps would be re-run unnecessarily. Although I do understand that the risk of that is low, and in any case, no data would be lost, just a little time and wasted resources. What are your thoughts?

        Show
        eepayne Eric Payne added a comment - Thanks Rushabh S Shah for reporting this issue and providing a patch. Overall, the patch looks good. I am a little nervous about re-fetching for any exception. If there is a runtime exception on the reducer (memory error, NPE, etc.), maps would be re-run unnecessarily. Although I do understand that the risk of that is low, and in any case, no data would be lost, just a little time and wasted resources. What are your thoughts?
        Hide
        shahrs87 Rushabh S Shah added a comment -

        If there is a runtime exception on the reducer (memory error, NPE, etc.), maps would be re-run unnecessarily.

        In this case the decompressor threw RuntimeException (ArrayIndexOutOfBondsException is a subclass).
        If we had re run the map on another node, the job would have succeeded.

        I am a little nervous about re-fetching for any exception.

        I understand your concern but I think its a good change according to me.

        Show
        shahrs87 Rushabh S Shah added a comment - If there is a runtime exception on the reducer (memory error, NPE, etc.), maps would be re-run unnecessarily. In this case the decompressor threw RuntimeException (ArrayIndexOutOfBondsException is a subclass). If we had re run the map on another node, the job would have succeeded. I am a little nervous about re-fetching for any exception. I understand your concern but I think its a good change according to me.
        Hide
        shahrs87 Rushabh S Shah added a comment -

        Ran the failed junit failure on bith jdk7 and jdk8.
        Both of them passed fine on my machine.

        Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.54 sec <<< FAILURE! - in org.apache.hadoop.mapreduce.tools.TestCLI
        testGetJob(org.apache.hadoop.mapreduce.tools.TestCLI)  Time elapsed: 0.084 sec  <<< FAILURE!
        java.lang.AssertionError: null
        	at org.junit.Assert.fail(Assert.java:86)
        	at org.junit.Assert.assertTrue(Assert.java:41)
        	at org.junit.Assert.assertTrue(Assert.java:52)
        	at org.apache.hadoop.mapreduce.tools.TestCLI.testGetJob(TestCLI.java:181)
        
        Show
        shahrs87 Rushabh S Shah added a comment - Ran the failed junit failure on bith jdk7 and jdk8. Both of them passed fine on my machine. Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.54 sec <<< FAILURE! - in org.apache.hadoop.mapreduce.tools.TestCLI testGetJob(org.apache.hadoop.mapreduce.tools.TestCLI) Time elapsed: 0.084 sec <<< FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.mapreduce.tools.TestCLI.testGetJob(TestCLI.java:181)
        Hide
        eepayne Eric Payne added a comment -

        In this case the decompressor threw RuntimeException (ArrayIndexOutOfBondsException is a subclass).
        If we had re run the map on another node, the job would have succeeded.
        ...
        I understand your concern but I think its a good change according to me.

        Thanks Rushabh S Shah]. It would be ideal to come up with a subset that would cover only the exceptions that could be thrown, but I agree that the change is fine as it is.
        +1

        Show
        eepayne Eric Payne added a comment - In this case the decompressor threw RuntimeException (ArrayIndexOutOfBondsException is a subclass). If we had re run the map on another node, the job would have succeeded. ... I understand your concern but I think its a good change according to me. Thanks Rushabh S Shah ]. It would be ideal to come up with a subset that would cover only the exceptions that could be thrown, but I agree that the change is fine as it is. +1
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #9586 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9586/)
        MAPREDUCE-6633. AM should retry map attempts if the reduce task (epayne: rev 1fec06e037d2b22dafc64f33d4f1231bef4ceba8)

        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #9586 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9586/ ) MAPREDUCE-6633 . AM should retry map attempts if the reduce task (epayne: rev 1fec06e037d2b22dafc64f33d4f1231bef4ceba8) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
        Hide
        shahrs87 Rushabh S Shah added a comment -

        Eric Payne: Thanks for the reviews and committing.
        Does it make sense to fix it in 2.7 branch also ?

        Show
        shahrs87 Rushabh S Shah added a comment - Eric Payne : Thanks for the reviews and committing. Does it make sense to fix it in 2.7 branch also ?
        Hide
        eepayne Eric Payne added a comment -

        Thanks Rushabh S Shah. I cherry picked this back to 2.7.

        Show
        eepayne Eric Payne added a comment - Thanks Rushabh S Shah . I cherry picked this back to 2.7.
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        Closing the JIRA as part of 2.7.3 release.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - Closing the JIRA as part of 2.7.3 release.

          People

          • Assignee:
            shahrs87 Rushabh S Shah
            Reporter:
            shahrs87 Rushabh S Shah
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development