Hadoop Common
  1. Hadoop Common
  2. HADOOP-5280

When expiring a lost launched task, JT doesn't remove the attempt from the taskidToTIPMap.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.2, 0.20.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    1. 5280.patch
      1.0 kB
      Devaraj Das

      Activity

      Vinod Kumar Vavilapalli created issue -
      Hide
      Vinod Kumar Vavilapalli added a comment -

      On one of the clusters, a map attempt was expired as a lost task in ExpireLaunchingTasks thread, but it was not removed from taskidToTIPMap. All the reducers were informed that the map has failed. In the next heartbeat the TT came back reporting the attempt as a success, thereby preventing launch of any new map attempts for this task.
      Subsequently, all the reduces just got stalled waiting for the output from this map task and the whole job got stock with no progress.

      Show
      Vinod Kumar Vavilapalli added a comment - On one of the clusters, a map attempt was expired as a lost task in ExpireLaunchingTasks thread, but it was not removed from taskidToTIPMap. All the reducers were informed that the map has failed. In the next heartbeat the TT came back reporting the attempt as a success, thereby preventing launch of any new map attempts for this task. Subsequently, all the reduces just got stalled waiting for the output from this map task and the whole job got stock with no progress.
      Vinod Kumar Vavilapalli made changes -
      Field Original Value New Value
      Priority Major [ 3 ] Blocker [ 1 ]
      Affects Version/s 0.20.0 [ 12313438 ]
      Component/s mapred [ 12310690 ]
      Hemanth Yamijala made changes -
      Fix Version/s 0.20.0 [ 12313438 ]
      Hide
      Devaraj Das added a comment -

      Attaching patch. Vinod, could you please test things out with this patch? Thanks!

      Show
      Devaraj Das added a comment - Attaching patch. Vinod, could you please test things out with this patch? Thanks!
      Devaraj Das made changes -
      Attachment 5280.patch [ 12400424 ]
      Devaraj Das made changes -
      Assignee Devaraj Das [ devaraj ]
      Hide
      Vinod Kumar Vavilapalli added a comment -

      The original circumstances under which this bug was revealed was HADOOP-5285. With the above patch, and without patch for HADOOP-5285, the symptom of stuck reducers waiting for output from already failed tasks doesn't seem to be visible any more.

      The patch uploaded prevents tasks from wrongly going from FAILED state to any of UNASSIGNED, RUNNING, COMMI_PENDING or SUCCEEDED and looks fine.

      `ant test` and `ant test-patch` passed successfully on my local machine. +1 overall.

      Show
      Vinod Kumar Vavilapalli added a comment - The original circumstances under which this bug was revealed was HADOOP-5285 . With the above patch, and without patch for HADOOP-5285 , the symptom of stuck reducers waiting for output from already failed tasks doesn't seem to be visible any more. The patch uploaded prevents tasks from wrongly going from FAILED state to any of UNASSIGNED, RUNNING, COMMI_PENDING or SUCCEEDED and looks fine. `ant test` and `ant test-patch` passed successfully on my local machine. +1 overall.
      Hide
      Devaraj Das added a comment -

      I just committed this to the 0.20 and 0.21 branches. We should commit this to the 0.19 branch after the release of 0.19.1.

      Show
      Devaraj Das added a comment - I just committed this to the 0.20 and 0.21 branches. We should commit this to the 0.19 branch after the release of 0.19.1.
      Devaraj Das made changes -
      Hadoop Flags [Reviewed]
      Fix Version/s 0.21.0 [ 12313563 ]
      Resolution Fixed [ 1 ]
      Status Open [ 1 ] Resolved [ 5 ]
      Hide
      Devaraj Das added a comment -

      I committed this to the 0.19 branch.

      Show
      Devaraj Das added a comment - I committed this to the 0.19 branch.
      Devaraj Das made changes -
      Affects Version/s 0.20.0 [ 12313438 ]
      Fix Version/s 0.19.2 [ 12313650 ]
      Hide
      Hudson added a comment -
      Show
      Hudson added a comment - Integrated in Hadoop-trunk #766 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/766/ )
      Nigel Daley made changes -
      Fix Version/s 0.21.0 [ 12313563 ]
      Nigel Daley made changes -
      Status Resolved [ 5 ] Closed [ 6 ]
      Owen O'Malley made changes -
      Component/s mapred [ 12310690 ]

        People

        • Assignee:
          Devaraj Das
          Reporter:
          Vinod Kumar Vavilapalli
        • Votes:
          0 Vote for this issue
          Watchers:
          1 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development