Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.204.0, 0.23.0
    • Fix Version/s: 0.20.204.0, 0.23.0
    • Component/s: tasktracker
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Added 2 new config parameters:

      mapreduce.reduce.shuffle.catch.exception.stack.regex
      mapreduce.reduce.shuffle.catch.exception.message.regex
      Show
      Added 2 new config parameters: mapreduce.reduce.shuffle.catch.exception.stack.regex mapreduce.reduce.shuffle.catch.exception.message.regex

      Description

      We are seeing many instances of the Jetty-1342 (http://jira.codehaus.org/browse/JETTY-1342). The bug doesn't cause Jetty to stop responding altogether, some fetches go through but a lot of them throw exceptions and eventually fail. The only way we have found to get the TT out of this state is to restart the TT. This jira is to catch this particular exception (or perhaps a configurable regex) and handle it in an automated way to either blacklist or shutdown the TT after seeing it a configurable number of them.

      1. M2529-1.patch
        16 kB
        Chris Douglas
      2. M2529-1-20s.patch
        14 kB
        Chris Douglas
      3. mapred2529-trunk.patch
        18 kB
        Thomas Graves
      4. jetty1342-20security.patch
        16 kB
        Thomas Graves

        Activity

        Hide
        Owen O'Malley added a comment -

        Hadoop 0.20.204.0 was just released.

        Show
        Owen O'Malley added a comment - Hadoop 0.20.204.0 was just released.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #722 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/722/)

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #722 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/722/ )
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #714 (See https://builds.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/714/)
        MAPREDUCE-2529. Add support for regex-based shuffle metric counting
        exceptions. Contributed by Thomas Graves

        cdouglas : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1131736
        Files :

        • /hadoop/mapreduce/trunk/CHANGES.txt
        • /hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapred/TaskTracker.java
        • /hadoop/mapreduce/trunk/src/test/mapred/org/apache/hadoop/mapred/TestShuffleExceptionCount.java
        • /hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/server/jobtracker/JTConfig.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #714 (See https://builds.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/714/ ) MAPREDUCE-2529 . Add support for regex-based shuffle metric counting exceptions. Contributed by Thomas Graves cdouglas : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1131736 Files : /hadoop/mapreduce/trunk/CHANGES.txt /hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapred/TaskTracker.java /hadoop/mapreduce/trunk/src/test/mapred/org/apache/hadoop/mapred/TestShuffleExceptionCount.java /hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/server/jobtracker/JTConfig.java
        Hide
        Chris Douglas added a comment -

        +1

        I committed this. Thanks, Tom!

        Show
        Chris Douglas added a comment - +1 I committed this. Thanks, Tom!
        Hide
        Thomas Graves added a comment -

        Patches look good. Thanks for making the changes.

        Show
        Thomas Graves added a comment - Patches look good. Thanks for making the changes.
        Hide
        Chris Douglas added a comment -

        If the currently used version of Jetty is buggy then why can't we revert back to the previous stable (well tested) release? Apologies if its already discussed.

        The last version had a XSS attack, performance issues, and a memory leak. The XSS attack was what finally compelled the upgrade.

        Show
        Chris Douglas added a comment - If the currently used version of Jetty is buggy then why can't we revert back to the previous stable (well tested) release? Apologies if its already discussed. The last version had a XSS attack, performance issues, and a memory leak. The XSS attack was what finally compelled the upgrade.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12481433/M2529-1.patch
        against trunk revision 1131265.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.cli.TestMRCLI

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/346//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/346//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/346//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481433/M2529-1.patch against trunk revision 1131265. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestMRCLI +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/346//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/346//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/346//console This message is automatically generated.
        Hide
        Amar Kamat added a comment -

        If the currently used version of Jetty is buggy then why can't we revert back to the previous stable (well tested) release? Apologies if its already discussed.

        Show
        Amar Kamat added a comment - If the currently used version of Jetty is buggy then why can't we revert back to the previous stable (well tested) release? Apologies if its already discussed.
        Hide
        Chris Douglas added a comment -

        Minor nits:

        • As a default, always incrementing the metric for undefined regex probably makes more sense
        • null is probably a better default than the empty string
        • There's a possible NPE if the exception message is null
        • The unit test is setting combinations of the stack/message regex, but it calls checkStackException in a few places, which doesn't exercise that logic (I think it's covered, but that could be clearer)
        • While this will be useful while we work around bugs emerging from Jetty, we should probably keep it as an undocumented config setting.
        • The trunk patch updates MRJobConfig, which is for user jobs. Moved to JTConfig

        This slight modification defines exceptions with null messages as matching no regexp. Let me know if it looks OK to you

        Show
        Chris Douglas added a comment - Minor nits: As a default, always incrementing the metric for undefined regex probably makes more sense null is probably a better default than the empty string There's a possible NPE if the exception message is null The unit test is setting combinations of the stack/message regex, but it calls checkStackException in a few places, which doesn't exercise that logic (I think it's covered, but that could be clearer) While this will be useful while we work around bugs emerging from Jetty, we should probably keep it as an undocumented config setting. The trunk patch updates MRJobConfig , which is for user jobs. Moved to JTConfig This slight modification defines exceptions with null messages as matching no regexp. Let me know if it looks OK to you
        Hide
        Thomas Graves added a comment -

        cancelling to get hudson to run again.

        Show
        Thomas Graves added a comment - cancelling to get hudson to run again.
        Hide
        Thomas Graves added a comment -

        findbugs warning was there before my change and test failure (org.apache.hadoop.raid.TestBlockFixerParityBlockFixDist.testGeneratedLastBlockLocal) again isn't from this change. I ran it in my environment and it worked:

        test-junit:
        [junit] WARNING: multiple versions of ant detected in path for junit
        [junit] jar:file:/home/tgraves/hadoop/apache-ant-1.8.2/lib/ant.jar!/org/apache/tools/ant/Project.class
        [junit] and jar:file:/home/tgraves/.ivy2/cache/ant/ant/jars/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
        [junit] Running org.apache.hadoop.raid.TestBlockFixerParityBlockFixDist

        [junit] Tests run: 9, Failures: 0, Errors: 0, Time elapsed: 257.206 sec

        Show
        Thomas Graves added a comment - findbugs warning was there before my change and test failure (org.apache.hadoop.raid.TestBlockFixerParityBlockFixDist.testGeneratedLastBlockLocal) again isn't from this change. I ran it in my environment and it worked: test-junit: [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar: file:/home/tgraves/hadoop/apache-ant-1.8.2/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar: file:/home/tgraves/.ivy2/cache/ant/ant/jars/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.hadoop.raid.TestBlockFixerParityBlockFixDist [junit] Tests run: 9, Failures: 0, Errors: 0, Time elapsed: 257.206 sec
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12480969/mapred2529-trunk.patch
        against trunk revision 1129771.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/322//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/322//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/322//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12480969/mapred2529-trunk.patch against trunk revision 1129771. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/322//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/322//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/322//console This message is automatically generated.
        Hide
        Thomas Graves added a comment -

        patch for trunk.

        Show
        Thomas Graves added a comment - patch for trunk.
        Hide
        Liyin Liang added a comment -

        After upgrading to jetty 6.1.26, our product cluster met the same problem. Through observation, we found TT will throw lots of "java.io.IOException: Broken pipe" when serve map-output and Jetty print logs as follows in this case.

        2011-05-30 00:11:06,389 INFO org.mortbay.log: org.mortbay.io.nio.SelectorManager$SelectSet@6cf3a37f Busy selector - injecting delay 3 times

        So we just grep "Busy selector" from TT's log to detect this bug.

        Show
        Liyin Liang added a comment - After upgrading to jetty 6.1.26, our product cluster met the same problem. Through observation, we found TT will throw lots of "java.io.IOException: Broken pipe" when serve map-output and Jetty print logs as follows in this case. 2011-05-30 00:11:06,389 INFO org.mortbay.log: org.mortbay.io.nio.SelectorManager$SelectSet@6cf3a37f Busy selector - injecting delay 3 times So we just grep "Busy selector" from TT's log to detect this bug.
        Hide
        Thomas Graves added a comment -

        patch for branch-0.20-security

        Show
        Thomas Graves added a comment - patch for branch-0.20-security
        Hide
        Thomas Graves added a comment -

        I'm proposing to add a new metric to the shuffle output metrics and increment it when it sees a configurable regex in the IOexception in the MapOutputServlet. This metric can then be viewed by external systems or potentially the health_check script (HADOOP-7144 should make that easier). Making it configurable will make it more useful in the future in case we see other Jetty/JVM exceptions/issues that need to be worked around.

        Show
        Thomas Graves added a comment - I'm proposing to add a new metric to the shuffle output metrics and increment it when it sees a configurable regex in the IOexception in the MapOutputServlet. This metric can then be viewed by external systems or potentially the health_check script ( HADOOP-7144 should make that easier). Making it configurable will make it more useful in the future in case we see other Jetty/JVM exceptions/issues that need to be worked around.

          People

          • Assignee:
            Thomas Graves
            Reporter:
            Thomas Graves
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development