Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-733

When running ant test TestTrackerBlacklistAcrossJobs, losing task tracker heartbeat exception occurs.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: tasktracker
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      When running ant test TestTrackerBlacklistAcrossJobs, losing task tracker heartbeat.

      It seems when a task tracker is killed , it throws exception. Instead it should catch it and process it and allow the rest of the flow to go through.

      2009-07-08 11:58:26,116 INFO ipc.Server (Server.java:run(973)) - IPC Server handler 7 on 40193, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@13ec758, false, false, true, 6) from 127.0.0.1:40200: error: java.io.IOException: java.lang.RuntimeException: tracker_host1.rack.com:localhost/127.0.0.1:40197 already has slots reserved for null; being asked to un-reserve for job_200907081158_0001
      java.io.IOException: java.lang.RuntimeException: tracker_host1.rack.com:localhost/127.0.0.1:40197 already has slots reserved for null; being asked to un-reserve for job_200907081158_0001
      at org.apache.hadoop.mapreduce.server.jobtracker.TaskTracker.unreserveSlots(TaskTracker.java:162)
      at org.apache.hadoop.mapred.JobInProgress.addTrackerTaskFailure(JobInProgress.java:1580)
      at org.apache.hadoop.mapred.JobInProgress.failedTask(JobInProgress.java:2908)
      at org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:1025)
      at org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:3869)
      at org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:3081)
      at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2819)
      at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:960)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:396)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:958)
      2009-07-08 11:58:26,162 INFO mapred.TaskTracker (TaskTracker.java:transmitHeartBeat(1196)) - Resending 'status' to 'localhost' with reponseId '6

      1. MAPREDUCE-733_0_20090708_yhadoop20.patch
        1 kB
        Arun C Murthy
      2. MAPREDUCE-733_0_20090708.patch
        1 kB
        Arun C Murthy
      3. MAPREDUCE-733-1.patch
        8 kB
        Sreekanth Ramakrishnan
      4. MAPREDUCE-733-2.patch
        9 kB
        Sreekanth Ramakrishnan
      5. MAPREDUCE-733-3.patch
        14 kB
        Sreekanth Ramakrishnan
      6. MAPREDUCE-733-4.patch
        14 kB
        Sreekanth Ramakrishnan
      7. MAPREDUCE-733-5.patch
        14 kB
        Sreekanth Ramakrishnan
      8. MAPREDUCE-733-ydist.patch
        2 kB
        Sreekanth Ramakrishnan

        Activity

        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #20 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/20/)

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #20 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/20/ )
        Hide
        Hemanth Yamijala added a comment -

        I committed this to trunk. Thanks, Arun and Sreekanth !

        Show
        Hemanth Yamijala added a comment - I committed this to trunk. Thanks, Arun and Sreekanth !
        Hide
        Sreekanth Ramakrishnan added a comment -

        Core and contrib tests passed locally.

        Show
        Sreekanth Ramakrishnan added a comment - Core and contrib tests passed locally.
        Hide
        Hemanth Yamijala added a comment -

        This looks good to me. +1.

        Show
        Hemanth Yamijala added a comment - This looks good to me. +1.
        Hide
        Sreekanth Ramakrishnan added a comment -

        attaching output from ant test-patch

             [exec] +1 overall.
             [exec]
             [exec]     +1 @author.  The patch does not contain any @author tags.
             [exec]
             [exec]     +1 tests included.  The patch appears to include 9 new or modified tests.
             [exec]
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
             [exec]
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec]
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
             [exec]
             [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
        
        Show
        Sreekanth Ramakrishnan added a comment - attaching output from ant test-patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 9 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
        Hide
        Sreekanth Ramakrishnan added a comment -

        Internal Y! distribution patch

        Show
        Sreekanth Ramakrishnan added a comment - Internal Y! distribution patch
        Hide
        Sreekanth Ramakrishnan added a comment -

        Latest patch with removing log statements.

        Show
        Sreekanth Ramakrishnan added a comment - Latest patch with removing log statements.
        Hide
        Sreekanth Ramakrishnan added a comment -

        Removing log statements which were not needed in previous patch.

        Show
        Sreekanth Ramakrishnan added a comment - Removing log statements which were not needed in previous patch.
        Hide
        Sreekanth Ramakrishnan added a comment -

        Fixing a bug in previous patch.

        • Added new test case to test issue.
        Show
        Sreekanth Ramakrishnan added a comment - Fixing a bug in previous patch. Added new test case to test issue.
        Hide
        Hemanth Yamijala added a comment -

        Sigh. I just realized there was a bug in the original fix as well. trackersReservedForMaps is a map of tasktrackers to FallowSlotInfo. The fix checks for presence of a tracker by name instead of the tasktracker object. So, while this will work to fix the original bug in this JIRA, it will cause a new bug that it will not remove reservations even in a valid case. Unfortunately, this indicates that we need one more test case that tests the positive condition.

        Show
        Hemanth Yamijala added a comment - Sigh. I just realized there was a bug in the original fix as well. trackersReservedForMaps is a map of tasktrackers to FallowSlotInfo. The fix checks for presence of a tracker by name instead of the tasktracker object. So, while this will work to fix the original bug in this JIRA, it will cause a new bug that it will not remove reservations even in a valid case. Unfortunately, this indicates that we need one more test case that tests the positive condition.
        Hide
        Sreekanth Ramakrishnan added a comment -

        Added some comments to the test case.

        Show
        Sreekanth Ramakrishnan added a comment - Added some comments to the test case.
        Hide
        Sreekanth Ramakrishnan added a comment -

        Attaching patch adding new test case:

        The patch will apply after the patch for MAPREDUCE-734 has been applied.

        Show
        Sreekanth Ramakrishnan added a comment - Attaching patch adding new test case: The patch will apply after the patch for MAPREDUCE-734 has been applied.
        Hide
        Iyappan Srinivasan added a comment -

        I ran TestTrackerBlacklistAcrossJobs and it passed and logs do not have "java.io.IOException: java.lang.RuntimeException". I also brought up a cluster, submitted jobs randomly and killed some task attempts and did not come across this string in jobtracker log.

        Show
        Iyappan Srinivasan added a comment - I ran TestTrackerBlacklistAcrossJobs and it passed and logs do not have "java.io.IOException: java.lang.RuntimeException". I also brought up a cluster, submitted jobs randomly and killed some task attempts and did not come across this string in jobtracker log.
        Hide
        Hemanth Yamijala added a comment -

        These changes look good. Again, I think a test case would be helpful. For that, we'll do something similar as in MAPREDUCE-734 where we are adding a test via the FakeObjectUtilities APIs.

        Show
        Hemanth Yamijala added a comment - These changes look good. Again, I think a test case would be helpful. For that, we'll do something similar as in MAPREDUCE-734 where we are adding a test via the FakeObjectUtilities APIs.
        Hide
        Hemanth Yamijala added a comment -

        One more thing I've observed while going through this is that reservations are not removed on a TaskTracker that is globally blacklisted either via large task-failure count or via unhealthy status.

        I had filed MAPREDUCE-682 for tracking this. Arun, if you remember, we had discussed this a couple of days back. We decided it was not a major problem. For a short while these reservations might remain for the blacklisted nodes and count against the job. But the nodes which are healthy can pick up and run the tasks of the job (in the next wave), which might have run on the blacklisted trackers.

        Show
        Hemanth Yamijala added a comment - One more thing I've observed while going through this is that reservations are not removed on a TaskTracker that is globally blacklisted either via large task-failure count or via unhealthy status. I had filed MAPREDUCE-682 for tracking this. Arun, if you remember, we had discussed this a couple of days back. We decided it was not a major problem. For a short while these reservations might remain for the blacklisted nodes and count against the job. But the nodes which are healthy can pick up and run the tasks of the job (in the next wave), which might have run on the blacklisted trackers.
        Hide
        Arun C Murthy added a comment -

        Patch for the yahoop hadoop-20 branch.

        Show
        Arun C Murthy added a comment - Patch for the yahoop hadoop-20 branch.
        Hide
        Arun C Murthy added a comment -

        Do you mean to say that for any job that is currently running and for future jobs, this should be done?

        I meant for jobs which currently have tasks scheduled on the faulty tasktracker...

        Show
        Arun C Murthy added a comment - Do you mean to say that for any job that is currently running and for future jobs, this should be done? I meant for jobs which currently have tasks scheduled on the faulty tasktracker...
        Hide
        Devaraj Das added a comment -

        tracker being globally blacklisted (and declared 'unhealthy' ?) isn't propagated to the job (via JobInProgress.addTrackerTaskFailure).

        Do you mean to say that for any job that is currently running and for future jobs, this should be done? Is there a good use case for this? The reason i am asking this is because globally blacklisted trackers are not considered for assigning new tasks at all. We may run into race conditions (especially for future jobs), where a globally blacklisted tracker may not be considered by jobs for assigning new tasks to, even when it is marked healthy (globally). For example, a job starts an hour before the tracker is supposed to be marked healthy, but since the job blacklists it prematurely, even after it is marked healthy, the job cannot make use of this tracker. This can probably be handled but it might complicate the logic of global blacklisting & per-job blacklisting.

        Show
        Devaraj Das added a comment - tracker being globally blacklisted (and declared 'unhealthy' ?) isn't propagated to the job (via JobInProgress.addTrackerTaskFailure). Do you mean to say that for any job that is currently running and for future jobs, this should be done? Is there a good use case for this? The reason i am asking this is because globally blacklisted trackers are not considered for assigning new tasks at all. We may run into race conditions (especially for future jobs), where a globally blacklisted tracker may not be considered by jobs for assigning new tasks to, even when it is marked healthy (globally). For example, a job starts an hour before the tracker is supposed to be marked healthy, but since the job blacklists it prematurely, even after it is marked healthy, the job cannot make use of this tracker. This can probably be handled but it might complicate the logic of global blacklisting & per-job blacklisting.
        Hide
        Arun C Murthy added a comment -

        One more thing I've observed while going through this is that reservations are not removed on a TaskTracker that is globally blacklisted either via large task-failure count or via unhealthy status.

        I swear I saw globally blacklisted trackers being declared 'lost' (via JobTracker.lostTaskTracker)! Maybe I'm just getting old... smile

        Anyway, I'm very surprised that the information about a tracker being globally blacklisted (and declared 'unhealthy' ?) isn't propagated to the job (via JobInProgress.addTrackerTaskFailure). This is a serious drawback of the current implementations of the feature(s). This isn't 'critical' but I think we should address them asap via a separate jira. Thoughts?

        Show
        Arun C Murthy added a comment - One more thing I've observed while going through this is that reservations are not removed on a TaskTracker that is globally blacklisted either via large task-failure count or via unhealthy status. I swear I saw globally blacklisted trackers being declared 'lost' (via JobTracker.lostTaskTracker)! Maybe I'm just getting old... smile Anyway, I'm very surprised that the information about a tracker being globally blacklisted (and declared 'unhealthy' ?) isn't propagated to the job (via JobInProgress.addTrackerTaskFailure). This is a serious drawback of the current implementations of the feature(s). This isn't 'critical' but I think we should address them asap via a separate jira. Thoughts?
        Hide
        Arun C Murthy added a comment -

        This patch seems to fix the expection, though I'm not sure why the test-case doesn't fail even with this exception.

        The crux of the patch is:

        -          taskTracker.unreserveSlots(TaskType.MAP, this);
        -          taskTracker.unreserveSlots(TaskType.REDUCE, this);
        +          if (trackersReservedForMaps.containsKey(trackerName)) {
        +            taskTracker.unreserveSlots(TaskType.MAP, this);
        +          }
        +          if (trackersReservedForReduces.containsKey(trackerName)) {
        +            taskTracker.unreserveSlots(TaskType.REDUCE, this);
        +          }
        

        I've also noticed that JobInProgress.addTrackerTaskFailure wasn't synchronized which breaks our invariant of locking JobTracker and JobInProgress (in-order), in particular from the call originating at JobTracker.lostTaskTracker. Maybe it wasn't necessary before - anyway it's been fixed now.

        Show
        Arun C Murthy added a comment - This patch seems to fix the expection, though I'm not sure why the test-case doesn't fail even with this exception. The crux of the patch is: - taskTracker.unreserveSlots(TaskType.MAP, this ); - taskTracker.unreserveSlots(TaskType.REDUCE, this ); + if (trackersReservedForMaps.containsKey(trackerName)) { + taskTracker.unreserveSlots(TaskType.MAP, this ); + } + if (trackersReservedForReduces.containsKey(trackerName)) { + taskTracker.unreserveSlots(TaskType.REDUCE, this ); + } I've also noticed that JobInProgress.addTrackerTaskFailure wasn't synchronized which breaks our invariant of locking JobTracker and JobInProgress (in-order), in particular from the call originating at JobTracker.lostTaskTracker. Maybe it wasn't necessary before - anyway it's been fixed now.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Just looked at the code causing this. This happens whenever there is an attempt to unreserve a job's tasks from a TaskTracker even though the reservation is for a job other than this job. This supposedly must have been done during MAPREDUCE-516 itself, but unfortunately missed (https://issues.apache.org/jira/browse/MAPREDUCE-516?focusedCommentId=12721792&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12721792).

        The resultant behavior is that when a task fails, one heartbeat of the TT is missed, but the next heartBeat passes through. This is because the first heartBeat marks the task as FAILED on the JobTracker and so the faulty code isn't invoked for the same TT again in further heartBeats. This leaves inconsistent state on the JT, for e.g, immediately following this is the code for creation of task completion event which would never be created for this task. This issue HAS to be fixed immediately because of the side effects.

        One more thing I've observed while going through this is that reservations are not removed on a TaskTracker that is globally blacklisted either via large task-failure count or via unhealthy status.

        Show
        Vinod Kumar Vavilapalli added a comment - Just looked at the code causing this. This happens whenever there is an attempt to unreserve a job's tasks from a TaskTracker even though the reservation is for a job other than this job. This supposedly must have been done during MAPREDUCE-516 itself, but unfortunately missed ( https://issues.apache.org/jira/browse/MAPREDUCE-516?focusedCommentId=12721792&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12721792 ). The resultant behavior is that when a task fails, one heartbeat of the TT is missed, but the next heartBeat passes through. This is because the first heartBeat marks the task as FAILED on the JobTracker and so the faulty code isn't invoked for the same TT again in further heartBeats. This leaves inconsistent state on the JT, for e.g, immediately following this is the code for creation of task completion event which would never be created for this task. This issue HAS to be fixed immediately because of the side effects. One more thing I've observed while going through this is that reservations are not removed on a TaskTracker that is globally blacklisted either via large task-failure count or via unhealthy status.

          People

          • Assignee:
            Arun C Murthy
            Reporter:
            Iyappan Srinivasan
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development