Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3851

Allow more aggressive action on detection of the jetty issue

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.0.2
    • Component/s: tasktracker
    • Labels:
      None
    • Target Version/s:
    • Release Note:
      Hide
      added new configuration variables to control when TT aborts if it sees a certain number of exceptions:

          // Percent of shuffle exceptions (out of sample size) seen before it's
          // fatal - acceptable values are from 0 to 1.0, 0 disables the check.
          // ie. 0.3 = 30% of the last X number of requests matched the exception,
          // so abort.
            conf.getFloat(
                "mapreduce.reduce.shuffle.catch.exception.percent.limit.fatal", 0);

          // The number of trailing requests we track, used for the fatal
          // limit calculation
            conf.getInt("mapreduce.reduce.shuffle.catch.exception.sample.size", 1000);
      Show
      added new configuration variables to control when TT aborts if it sees a certain number of exceptions:     // Percent of shuffle exceptions (out of sample size) seen before it's     // fatal - acceptable values are from 0 to 1.0, 0 disables the check.     // ie. 0.3 = 30% of the last X number of requests matched the exception,     // so abort.       conf.getFloat(           "mapreduce.reduce.shuffle.catch.exception.percent.limit.fatal", 0);     // The number of trailing requests we track, used for the fatal     // limit calculation       conf.getInt("mapreduce.reduce.shuffle.catch.exception.sample.size", 1000);

      Description

      MAPREDUCE-2529 added the useful failure detection mechanism. In this jira, I propose we add a periodic check inside TT and configurable action to self-destruct. Blacklisting helps but is not enough. Hung jetty still accepts connection and it takes very long time for clients to fail out. Short jobs are delayed for hours because of this. This feature will be a nice companion to MAPREDUCE-3184.

      1. MAPREDUCE-3851.patch
        11 kB
        Thomas Graves
      2. MAPREDUCE-3851.patch
        22 kB
        Thomas Graves
      3. MAPREDUCE-3851.patch
        22 kB
        Thomas Graves
      4. MAPREDUCE-3851.patch
        22 kB
        Thomas Graves

        Activity

        Hide
        Thomas Graves added a comment -

        Thanks Todd, great to hear!

        Show
        Thomas Graves added a comment - Thanks Todd, great to hear!
        Hide
        Todd Lipcon added a comment -

        Thomas Graves, I finally figured out the bug that causes the acceptors to occasionally not start. It's JETTY-1316. I pushed a patched jetty here: https://github.com/toddlipcon/jetty-hadoop-fix/

        and we'll try to deploy it to our maven repository as 6.1.26.cloudera.2 soon.

        Hope this helps!

        Show
        Todd Lipcon added a comment - Thomas Graves , I finally figured out the bug that causes the acceptors to occasionally not start. It's JETTY-1316. I pushed a patched jetty here: https://github.com/toddlipcon/jetty-hadoop-fix/ and we'll try to deploy it to our maven repository as 6.1.26.cloudera.2 soon. Hope this helps!
        Hide
        Matt Foley added a comment -

        Corrected Target and Fixed versions.

        Show
        Matt Foley added a comment - Corrected Target and Fixed versions.
        Hide
        Robert Joseph Evans added a comment -

        Thanks Tom. I just put this into branch-1 and branch-1.0

        Show
        Robert Joseph Evans added a comment - Thanks Tom. I just put this into branch-1 and branch-1.0
        Hide
        Robert Joseph Evans added a comment -

        Sounds good to me +1

        Show
        Robert Joseph Evans added a comment - Sounds good to me +1
        Hide
        Thomas Graves added a comment -

        in the original jetty detection jira - MAPREDUCE-2529 - Chris had asked to keep it undocumented see https://issues.apache.org/jira/browse/MAPREDUCE-2529?focusedCommentId=13044180&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13044180

        It appears that none of the configuration similar to this are documented so kept with that theme.

        Show
        Thomas Graves added a comment - in the original jetty detection jira - MAPREDUCE-2529 - Chris had asked to keep it undocumented see https://issues.apache.org/jira/browse/MAPREDUCE-2529?focusedCommentId=13044180&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13044180 It appears that none of the configuration similar to this are documented so kept with that theme.
        Hide
        Robert Joseph Evans added a comment -

        The patch looks good to me. The only thing that I can think of might be documentation. Do we want to have this documented? It will be removed in 0.23 when we no longer use Jetty so that makes me lean towards no, but the default config value is to have this disabled "mapreduce.reduce.shuffle.catch.exception.percent.limit.fatal" is 0.0 by default. which if it is something that is off by default users who have this problem will need to know how they can turn this on.

        Show
        Robert Joseph Evans added a comment - The patch looks good to me. The only thing that I can think of might be documentation. Do we want to have this documented? It will be removed in 0.23 when we no longer use Jetty so that makes me lean towards no, but the default config value is to have this disabled "mapreduce.reduce.shuffle.catch.exception.percent.limit.fatal" is 0.0 by default. which if it is something that is off by default users who have this problem will need to know how they can turn this on.
        Hide
        Thomas Graves added a comment -

        testpatch output - again the 10 findbugs aren't from this patch.

        [exec] -1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 6 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] -1 findbugs. The patch appears to introduce 10 new Findbugs (version 1.3.9) warnings.

        Show
        Thomas Graves added a comment - testpatch output - again the 10 findbugs aren't from this patch. [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] -1 findbugs. The patch appears to introduce 10 new Findbugs (version 1.3.9) warnings.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12515608/MAPREDUCE-3851.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1910//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12515608/MAPREDUCE-3851.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1910//console This message is automatically generated.
        Hide
        Thomas Graves added a comment -

        Thanks for the review Kihwal. I went ahead and updated it to make the check for size < 0 explicit.

        Show
        Thomas Graves added a comment - Thanks for the review Kihwal. I went ahead and updated it to make the check for size < 0 explicit.
        Hide
        Kihwal Lee added a comment -

        +1 The patch looks good to me. There is no explicit check and error message against size <= 0, but I guess it's okay since TT won't go any further and the stack trace should give a clue.

        Show
        Kihwal Lee added a comment - +1 The patch looks good to me. There is no explicit check and error message against size <= 0 , but I guess it's okay since TT won't go any further and the stack trace should give a clue.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12515092/MAPREDUCE-3851.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1897//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12515092/MAPREDUCE-3851.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1897//console This message is automatically generated.
        Hide
        Thomas Graves added a comment -

        exclude findbugs warning. test-patch output now matches branch-1.0 without this patch.

        [exec] -1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 6 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] -1 findbugs. The patch appears to introduce 10 new Findbugs (version 1.3.9) warnings.

        Show
        Thomas Graves added a comment - exclude findbugs warning. test-patch output now matches branch-1.0 without this patch. [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] -1 findbugs. The patch appears to introduce 10 new Findbugs (version 1.3.9) warnings.
        Hide
        Thomas Graves added a comment -

        Without this change there are 10 findbugs warnings. This one introduces:
        org.apache.hadoop.mapred.ShuffleExceptionTracker.doAbort() invokes System.exit(...), which shuts down the entire virtual machine

        This is expected, I will update patch to exclude this.

        [exec] -1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 3 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] -1 findbugs. The patch appears to introduce 11 new Findbugs (version 1.3.9) warnings. [exec]

        Show
        Thomas Graves added a comment - Without this change there are 10 findbugs warnings. This one introduces: org.apache.hadoop.mapred.ShuffleExceptionTracker.doAbort() invokes System.exit(...), which shuts down the entire virtual machine This is expected, I will update patch to exclude this. [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] -1 findbugs. The patch appears to introduce 11 new Findbugs (version 1.3.9) warnings. [exec]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12515044/MAPREDUCE-3851.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1892//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12515044/MAPREDUCE-3851.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1892//console This message is automatically generated.
        Hide
        Thomas Graves added a comment -

        patch now has the limit set as a % of the last X number requests. Pulled things out into their own class.

        Show
        Thomas Graves added a comment - patch now has the limit set as a % of the last X number requests. Pulled things out into their own class.
        Hide
        Todd Lipcon added a comment -

        Ratio over the last N requests sound good. I was worried about false triggering - eg if you get one random exception once every 100000 calls, but your TT is up for months, it will eventually crash with the other approach.

        Show
        Todd Lipcon added a comment - Ratio over the last N requests sound good. I was worried about false triggering - eg if you get one random exception once every 100000 calls, but your TT is up for months, it will eventually crash with the other approach.
        Hide
        Thomas Graves added a comment -

        All the instances I've seen where it hits JETTY-1342 it is hosed pretty much immediately and it doesn't recover after seeing just a couple of the exceptions. That is why I had it just a straight count over the lifetime.

        I guess it would make it more extensible for possible future Jetty bugs to have it be a ratio though. So how about we do the ratio of say the last 100 requests? If that ratio goes above a certain threshold, then we abort. we could make that number (100) configurable if someone really wants.

        Show
        Thomas Graves added a comment - All the instances I've seen where it hits JETTY-1342 it is hosed pretty much immediately and it doesn't recover after seeing just a couple of the exceptions. That is why I had it just a straight count over the lifetime. I guess it would make it more extensible for possible future Jetty bugs to have it be a ratio though. So how about we do the ratio of say the last 100 requests? If that ratio goes above a certain threshold, then we abort. we could make that number (100) configurable if someone really wants.
        Hide
        Todd Lipcon added a comment -

        Looks like the exception count is across the lifetime of the TT. It should instead be something like a trailing time window, or better, a trailing ratio of successes vs exceptions. Otherwise there's no reasonable way to set a value for the limit. right?

        Show
        Todd Lipcon added a comment - Looks like the exception count is across the lifetime of the TT. It should instead be something like a trailing time window, or better, a trailing ratio of successes vs exceptions. Otherwise there's no reasonable way to set a value for the limit. right?
        Hide
        Thomas Graves added a comment -

        Yes we've seen both instances in production. I don't have any metrics on which one we are seeing more. I should have a patch for this up later today, finishing up some testing.

        Show
        Thomas Graves added a comment - Yes we've seen both instances in production. I don't have any metrics on which one we are seeing more. I should have a patch for this up later today, finishing up some testing.
        Hide
        Todd Lipcon added a comment -

        Gotcha - you're correct that 3184 doesn't handle the broken acceptor issue. Have you seen this a lot in production? We run into JETTY-937 a lot, but haven't seen the broken acceptors in 6.1.26.

        Show
        Todd Lipcon added a comment - Gotcha - you're correct that 3184 doesn't handle the broken acceptor issue. Have you seen this a lot in production? We run into JETTY-937 a lot, but haven't seen the broken acceptors in 6.1.26.
        Hide
        Thomas Graves added a comment -

        We are looking at doing this to better handle http://jira.codehaus.org/browse/JETTY-1342. This is supposedly fixed in Jetty 6.1.27 - if it ever gets released, which I'm guessing is also in the patched 6.1.26 that Todd created.

        I believe MAPREDUCE-3184 is for http://jira.codehaus.org/browse/JETTY-937.

        We are seeing both and would like to more aggressive handle Jetty-1342 by exiting immediately so any running tasks trying to fetch data from this TT will get relaunched somewhere else rather then trying a bunch of times.

        Todd, as Kihwal mentioned if you think MAPREDUCE-3184 also covers Jetty-1342 please let us know.

        Show
        Thomas Graves added a comment - We are looking at doing this to better handle http://jira.codehaus.org/browse/JETTY-1342 . This is supposedly fixed in Jetty 6.1.27 - if it ever gets released, which I'm guessing is also in the patched 6.1.26 that Todd created. I believe MAPREDUCE-3184 is for http://jira.codehaus.org/browse/JETTY-937 . We are seeing both and would like to more aggressive handle Jetty-1342 by exiting immediately so any running tasks trying to fetch data from this TT will get relaunched somewhere else rather then trying a bunch of times. Todd, as Kihwal mentioned if you think MAPREDUCE-3184 also covers Jetty-1342 please let us know.
        Hide
        Kihwal Lee added a comment -

        @Todd If anyone can provide evidence or even a stat showing that the jetty exception behavior converges to spinning selector thread behavior in a short amount of time, I see no reason to have this jira.

        I've seen cases where some trackers having quite a few of the exceptions reported in MAPREDUCE-2529 while many new requests were served successfully. The fault-detection coverage of 100% is very critical to this problem. If it's not 100%, jobs with a large number of mappers will still suffer. Other than the potential overlap of coverage, do you see any risk in this? We want the threshold to be configurable and the default can be either to 0 (disable) or very high.

        Show
        Kihwal Lee added a comment - @Todd If anyone can provide evidence or even a stat showing that the jetty exception behavior converges to spinning selector thread behavior in a short amount of time, I see no reason to have this jira. I've seen cases where some trackers having quite a few of the exceptions reported in MAPREDUCE-2529 while many new requests were served successfully. The fault-detection coverage of 100% is very critical to this problem. If it's not 100%, jobs with a large number of mappers will still suffer. Other than the potential overlap of coverage, do you see any risk in this? We want the threshold to be configurable and the default can be either to 0 (disable) or very high.
        Hide
        Kihwal Lee added a comment -

        I was under the impression that the jetty health check in MAPREDUCE-2529 and the spinning jetty detection don't have the exact coverge. If MAPREDUCE-3184 alone can cover 100%, There is no reason to have this jira. In that case I will ask it to be included in 1.0.1.

        Show
        Kihwal Lee added a comment - I was under the impression that the jetty health check in MAPREDUCE-2529 and the spinning jetty detection don't have the exact coverge. If MAPREDUCE-3184 alone can cover 100%, There is no reason to have this jira. In that case I will ask it to be included in 1.0.1.
        Hide
        Todd Lipcon added a comment -

        Hey Kihwal. What's the distinction between this JIRA and MAPREDUCE-3184? Just making it configurable rather than doing a System.exit?

        Show
        Todd Lipcon added a comment - Hey Kihwal. What's the distinction between this JIRA and MAPREDUCE-3184 ? Just making it configurable rather than doing a System.exit?

          People

          • Assignee:
            Thomas Graves
            Reporter:
            Kihwal Lee
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development