Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4817

Hardcoded task ping timeout kills tasks localizing large amounts of data

    Details

      Description

      When a task is launched and spends more than 5 minutes localizing files, the AM will kill the task due to ping timeout. The AM's TaskHeartbeatHandler currently tracks tasks via a progress timeout and a ping timeout. The progress timeout can be controlled via mapreduce.task.timeout and even disabled by setting the property to 0. The ping timeout, however, is hardcoded to 5 minutes and cannot be configured. Therefore if the task takes too long localizing, it never gets running in order to ping back to the AM and the AM kills it due to ping timeout.

      1. MAPREDUCE-4817.patch
        3 kB
        Thomas Graves
      2. MAPREDUCE-4817.patch
        7 kB
        Thomas Graves

        Activity

        Hide
        Jason Lowe added a comment -

        One possible workaround is to abuse the mapreduce.task.timeout.check-interval-ms property with a large value, effectively disabling the timeout checking. Not ideal since runaway or zombie tasks will no longer be detected as they were before.

        Show
        Jason Lowe added a comment - One possible workaround is to abuse the mapreduce.task.timeout.check-interval-ms property with a large value, effectively disabling the timeout checking. Not ideal since runaway or zombie tasks will no longer be detected as they were before.
        Hide
        Thomas Graves added a comment -

        This jira will make the ping timeout configurable. MAPREDUCE-4818 will be for actual fix to the issue.

        Show
        Thomas Graves added a comment - This jira will make the ping timeout configurable. MAPREDUCE-4818 will be for actual fix to the issue.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        I was going to +1 for the proposal, but wasn't comfortable, so dig into some history. Please see my comments at MAPREDUCE-4089. Like I proposed there, we should knock off the ping thread altogether?

        But the problem persists, in that, for large enough local-resources, the taskTimeOut may eventually happen instead which we'll need to address. So, shall we just knock off the ping thread here and let MAPREDUCE-4818 put in the real fix?

        Show
        Vinod Kumar Vavilapalli added a comment - I was going to +1 for the proposal, but wasn't comfortable, so dig into some history. Please see my comments at MAPREDUCE-4089 . Like I proposed there, we should knock off the ping thread altogether? But the problem persists, in that, for large enough local-resources, the taskTimeOut may eventually happen instead which we'll need to address. So, shall we just knock off the ping thread here and let MAPREDUCE-4818 put in the real fix?
        Hide
        Thomas Graves added a comment -

        When you say knock off the ping thread I assume you really mean just the ping timeout check since the task progress happens in the same thread?

        So the ping serves multiple purposes. Currently it notifies the AM that the task has "pinged" in and is still running. This could be useful even with taskTimeout since the taskTimeout could be turned off (set to 0) and we would never know if that task got hung. Second, the task uses it to check to see if the AM is still alive. If it doesn't return true, the task is supposed to exit. 1.X also had the ping check, but it went to the taskTracker and the tasktracker validated that the parent Task of the ping checker thread was still there.

        Now with 0.23 the nodemanager is watching the processes and talking back to the RM to let it know that the AM died and if it died it kills the other tasks, but if the entire nodemanager goes down then the task doesn't know the AM went away. If the task isn't sending progress, and the task timeout is set to 0, and this is the last AM retry it could hang around forever.

        The odds of that seem pretty small and I guess if we aren't worried about the first happening, the second probably isn't that interesting either. But we could also just remove the ping timeout check in the TaskHeartBeatHandler. What exactly are you proposing?

        Show
        Thomas Graves added a comment - When you say knock off the ping thread I assume you really mean just the ping timeout check since the task progress happens in the same thread? So the ping serves multiple purposes. Currently it notifies the AM that the task has "pinged" in and is still running. This could be useful even with taskTimeout since the taskTimeout could be turned off (set to 0) and we would never know if that task got hung. Second, the task uses it to check to see if the AM is still alive. If it doesn't return true, the task is supposed to exit. 1.X also had the ping check, but it went to the taskTracker and the tasktracker validated that the parent Task of the ping checker thread was still there. Now with 0.23 the nodemanager is watching the processes and talking back to the RM to let it know that the AM died and if it died it kills the other tasks, but if the entire nodemanager goes down then the task doesn't know the AM went away. If the task isn't sending progress, and the task timeout is set to 0, and this is the last AM retry it could hang around forever. The odds of that seem pretty small and I guess if we aren't worried about the first happening, the second probably isn't that interesting either. But we could also just remove the ping timeout check in the TaskHeartBeatHandler. What exactly are you proposing?
        Hide
        Thomas Graves added a comment -

        here is the patch that add the config for the ping timeout. Attaching because it was finished already before other comments and in case we want to go that way.

        Show
        Thomas Graves added a comment - here is the patch that add the config for the ping timeout. Attaching because it was finished already before other comments and in case we want to go that way.
        Hide
        Thomas Graves added a comment -

        This patch removes the ping Timeout check from the AM task heart beat handler. If we want to remove the other side from each Task we can do that in separate jira.

        Show
        Thomas Graves added a comment - This patch removes the ping Timeout check from the AM task heart beat handler. If we want to remove the other side from each Task we can do that in separate jira.
        Hide
        Robert Joseph Evans added a comment -

        The patch is simple and straight forward I am +1 assuming that Jekins is OK with it. I am not sure that we need to update the task. The ping is used check if the task can reach the AM still. If you want to remove it go ahead and file a JIRA but it may have further ramifications.

        Show
        Robert Joseph Evans added a comment - The patch is simple and straight forward I am +1 assuming that Jekins is OK with it. I am not sure that we need to update the task. The ping is used check if the task can reach the AM still. If you want to remove it go ahead and file a JIRA but it may have further ramifications.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12555186/MAPREDUCE-4817.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3073//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3073//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12555186/MAPREDUCE-4817.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3073//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3073//console This message is automatically generated.
        Hide
        Thomas Graves added a comment -

        Thanks Bobby, I've committed this.

        Show
        Thomas Graves added a comment - Thanks Bobby, I've committed this.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-trunk-Commit #3070 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3070/)
        MAPREDUCE-4817. Hardcoded task ping timeout kills tasks localizing large amounts of data (tgraves) (Revision 1414873)

        Result = FAILURE
        tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1414873
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/TaskAttemptListenerImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/TaskHeartbeatHandler.java
        Show
        Hudson added a comment - Integrated in Hadoop-trunk-Commit #3070 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3070/ ) MAPREDUCE-4817 . Hardcoded task ping timeout kills tasks localizing large amounts of data (tgraves) (Revision 1414873) Result = FAILURE tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1414873 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/TaskAttemptListenerImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/TaskHeartbeatHandler.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Yarn-trunk #51 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/51/)
        MAPREDUCE-4817. Hardcoded task ping timeout kills tasks localizing large amounts of data (tgraves) (Revision 1414873)

        Result = SUCCESS
        tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1414873
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/TaskAttemptListenerImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/TaskHeartbeatHandler.java
        Show
        Hudson added a comment - Integrated in Hadoop-Yarn-trunk #51 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/51/ ) MAPREDUCE-4817 . Hardcoded task ping timeout kills tasks localizing large amounts of data (tgraves) (Revision 1414873) Result = SUCCESS tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1414873 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/TaskAttemptListenerImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/TaskHeartbeatHandler.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Build #450 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/450/)
        merge -r 1414872:1414873 from trunk. FIXES: MAPREDUCE-4817 (Revision 1414875)

        Result = SUCCESS
        tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1414875
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/TaskAttemptListenerImpl.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/TaskHeartbeatHandler.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #450 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/450/ ) merge -r 1414872:1414873 from trunk. FIXES: MAPREDUCE-4817 (Revision 1414875) Result = SUCCESS tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1414875 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/TaskAttemptListenerImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/TaskHeartbeatHandler.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #1241 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1241/)
        MAPREDUCE-4817. Hardcoded task ping timeout kills tasks localizing large amounts of data (tgraves) (Revision 1414873)

        Result = FAILURE
        tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1414873
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/TaskAttemptListenerImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/TaskHeartbeatHandler.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1241 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1241/ ) MAPREDUCE-4817 . Hardcoded task ping timeout kills tasks localizing large amounts of data (tgraves) (Revision 1414873) Result = FAILURE tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1414873 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/TaskAttemptListenerImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/TaskHeartbeatHandler.java

          People

          • Assignee:
            Thomas Graves
            Reporter:
            Jason Lowe
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development