Hadoop Common
  1. Hadoop Common
  2. HADOOP-5280

When expiring a lost launched task, JT doesn't remove the attempt from the taskidToTIPMap.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.2, 0.20.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    1. 5280.patch
      1.0 kB
      Devaraj Das

      Activity

      Hide
      Hudson added a comment -
      Show
      Hudson added a comment - Integrated in Hadoop-trunk #766 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/766/ )
      Hide
      Devaraj Das added a comment -

      I committed this to the 0.19 branch.

      Show
      Devaraj Das added a comment - I committed this to the 0.19 branch.
      Hide
      Devaraj Das added a comment -

      I just committed this to the 0.20 and 0.21 branches. We should commit this to the 0.19 branch after the release of 0.19.1.

      Show
      Devaraj Das added a comment - I just committed this to the 0.20 and 0.21 branches. We should commit this to the 0.19 branch after the release of 0.19.1.
      Hide
      Vinod Kumar Vavilapalli added a comment -

      The original circumstances under which this bug was revealed was HADOOP-5285. With the above patch, and without patch for HADOOP-5285, the symptom of stuck reducers waiting for output from already failed tasks doesn't seem to be visible any more.

      The patch uploaded prevents tasks from wrongly going from FAILED state to any of UNASSIGNED, RUNNING, COMMI_PENDING or SUCCEEDED and looks fine.

      `ant test` and `ant test-patch` passed successfully on my local machine. +1 overall.

      Show
      Vinod Kumar Vavilapalli added a comment - The original circumstances under which this bug was revealed was HADOOP-5285 . With the above patch, and without patch for HADOOP-5285 , the symptom of stuck reducers waiting for output from already failed tasks doesn't seem to be visible any more. The patch uploaded prevents tasks from wrongly going from FAILED state to any of UNASSIGNED, RUNNING, COMMI_PENDING or SUCCEEDED and looks fine. `ant test` and `ant test-patch` passed successfully on my local machine. +1 overall.
      Hide
      Devaraj Das added a comment -

      Attaching patch. Vinod, could you please test things out with this patch? Thanks!

      Show
      Devaraj Das added a comment - Attaching patch. Vinod, could you please test things out with this patch? Thanks!
      Hide
      Vinod Kumar Vavilapalli added a comment -

      On one of the clusters, a map attempt was expired as a lost task in ExpireLaunchingTasks thread, but it was not removed from taskidToTIPMap. All the reducers were informed that the map has failed. In the next heartbeat the TT came back reporting the attempt as a success, thereby preventing launch of any new map attempts for this task.
      Subsequently, all the reduces just got stalled waiting for the output from this map task and the whole job got stock with no progress.

      Show
      Vinod Kumar Vavilapalli added a comment - On one of the clusters, a map attempt was expired as a lost task in ExpireLaunchingTasks thread, but it was not removed from taskidToTIPMap. All the reducers were informed that the map has failed. In the next heartbeat the TT came back reporting the attempt as a success, thereby preventing launch of any new map attempts for this task. Subsequently, all the reduces just got stalled waiting for the output from this map task and the whole job got stock with no progress.

        People

        • Assignee:
          Devaraj Das
          Reporter:
          Vinod Kumar Vavilapalli
        • Votes:
          0 Vote for this issue
          Watchers:
          1 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development