Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2217

The expire launching task should cover the UNASSIGNED task

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.0, 1.1.1
    • Fix Version/s: 1.2.0
    • Component/s: jobtracker
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The ExpireLaunchingTask thread kills the task that are scheduled but not responded.
      Currently if a task is scheduled on tasktracker and for some reason tasktracker cannot put it to RUNNING.
      The task will just hang in the UNASSIGNED status and JobTracker will keep waiting for it.

      JobTracker.ExpireLaunchingTask should be able to kill this task.

      1. MR-2217.patch
        0.8 kB
        Karthik Kambatla
      2. MR-2217.patch
        0.8 kB
        Karthik Kambatla
      3. MAPREDUCE-2217.1.txt
        0.8 kB
        Scott Chen
      4. expose-bug-mr-2217.patch
        1 kB
        Karthik Kambatla

        Activity

        Scott Chen created issue -
        Scott Chen made changes -
        Field Original Value New Value
        Attachment MAPREDUCE-2217.txt [ 12466267 ]
        Hide
        Lianhui Wang added a comment -

        in the ExpireLaunchingTasks,run() may calls the JobTracker.killTask(). that can implement it.

        Show
        Lianhui Wang added a comment - in the ExpireLaunchingTasks,run() may calls the JobTracker.killTask(). that can implement it.
        Hide
        Scott Chen added a comment -

        Hi Lianhui,
        Yes, that's the idea. I have attached a patch for this. Currently we will do ExpireLaunchingTasks.removeTask() for any task that updates.
        But we should avoid remove the task that is UNASSIGNED so they can be killed by ExpireLaunchingTasks.run() as you described.

        Show
        Scott Chen added a comment - Hi Lianhui, Yes, that's the idea. I have attached a patch for this. Currently we will do ExpireLaunchingTasks.removeTask() for any task that updates. But we should avoid remove the task that is UNASSIGNED so they can be killed by ExpireLaunchingTasks.run() as you described.
        Hide
        Scott Chen added a comment -

        I will try to add a unit test for this.

        Show
        Scott Chen added a comment - I will try to add a unit test for this.
        Scott Chen made changes -
        Attachment MAPREDUCE-2217.1.txt [ 12469621 ]
        Scott Chen made changes -
        Attachment MAPREDUCE-2217.txt [ 12466267 ]
        Scott Chen made changes -
        Attachment MAPREDUCE-2217.1.txt [ 12469621 ]
        Scott Chen made changes -
        Attachment MAPREDUCE-2217.1.txt [ 12469622 ]
        Arun C Murthy made changes -
        Fix Version/s 0.24.0 [ 12317654 ]
        Fix Version/s 0.23.0 [ 12315570 ]
        Hide
        Karthik Kambatla (Inactive) added a comment -

        Uploading the same patch generated off of branch-1.

        We noticed this issue when one of the task trackers was faulty, and the unassigned tasks on that TaskTracker were not expired leading to job incompletion.

        +1 for the patch.

        Show
        Karthik Kambatla (Inactive) added a comment - Uploading the same patch generated off of branch-1. We noticed this issue when one of the task trackers was faulty, and the unassigned tasks on that TaskTracker were not expired leading to job incompletion. +1 for the patch.
        Karthik Kambatla (Inactive) made changes -
        Attachment MR-2217.patch [ 12553858 ]
        Karthik Kambatla (Inactive) made changes -
        Assignee Scott Chen [ schen ] Karthik Kambatla [ kkambatl ]
        Hide
        Scott Chen added a comment -

        Karthik: Thank you for working on this.

        Show
        Scott Chen added a comment - Karthik: Thank you for working on this.
        Hide
        Karthik Kambatla (Inactive) added a comment -

        Sorry for the delay, just got around to this.

        Uploading a patch that exposes the bug on clusters with some hosts with a 1 in their hostname. Running a sample pi job with 4 nodes with common prefix followed by 01-04, results in the job hanging at 75% map progress.

        Show
        Karthik Kambatla (Inactive) added a comment - Sorry for the delay, just got around to this. Uploading a patch that exposes the bug on clusters with some hosts with a 1 in their hostname. Running a sample pi job with 4 nodes with common prefix followed by 01-04, results in the job hanging at 75% map progress.
        Karthik Kambatla (Inactive) made changes -
        Attachment expose-bug-mr-2217.patch [ 12562950 ]
        Hide
        Karthik Kambatla (Inactive) added a comment -

        The patch posted on 16/Nov fixes the issue.

        To verify this I ran a hadoop cluster of 4 nodes with both MR-2217.patch and expose-bug-mr-2217.patch. The tasks assigned to machine01 timeout, and are subsequently scheduled on other nodes, and the job completes. Without MR-2217.patch, the job doesn't progress even after an hour. I used pi job with 8 mappers and 1000 input splits for this.

        Show
        Karthik Kambatla (Inactive) added a comment - The patch posted on 16/Nov fixes the issue. To verify this I ran a hadoop cluster of 4 nodes with both MR-2217.patch and expose-bug-mr-2217.patch. The tasks assigned to machine01 timeout, and are subsequently scheduled on other nodes, and the job completes. Without MR-2217.patch, the job doesn't progress even after an hour. I used pi job with 8 mappers and 1000 input splits for this.
        Hide
        Karthik Kambatla (Inactive) added a comment -

        Re-uploading the patch for Jenkins sanity.

        Show
        Karthik Kambatla (Inactive) added a comment - Re-uploading the patch for Jenkins sanity.
        Karthik Kambatla (Inactive) made changes -
        Attachment MR-2217.patch [ 12562996 ]
        Karthik Kambatla (Inactive) made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Affects Version/s 1.1.1 [ 12321660 ]
        Fix Version/s 0.24.0 [ 12317654 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12562996/MR-2217.patch
        against trunk revision .

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3187//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562996/MR-2217.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3187//console This message is automatically generated.
        Hide
        Alejandro Abdelnur added a comment -

        +1. Nice job forcing the problem to verify the fix.

        Show
        Alejandro Abdelnur added a comment - +1. Nice job forcing the problem to verify the fix.
        Alejandro Abdelnur made changes -
        Issue Type Improvement [ 4 ] Bug [ 1 ]
        Hide
        Alejandro Abdelnur added a comment -

        Thanks Scott & Karthik. Committed to branch-1.

        Show
        Alejandro Abdelnur added a comment - Thanks Scott & Karthik. Committed to branch-1.
        Alejandro Abdelnur made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags Reviewed [ 10343 ]
        Fix Version/s 1.2.0 [ 12321661 ]
        Resolution Fixed [ 1 ]
        Hide
        Matt Foley added a comment -

        Closed upon release of Hadoop 1.2.0.

        Show
        Matt Foley added a comment - Closed upon release of Hadoop 1.2.0.
        Matt Foley made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Bob Liu added a comment -

        Have you folks take Distributed Cache into consideration? During a large cache download, tasktrackers are being put in the "UNASSIGNED" stated as well, does that means jobs with "large" distributed cache won't even get a chance to launch at all?

        Show
        Bob Liu added a comment - Have you folks take Distributed Cache into consideration? During a large cache download, tasktrackers are being put in the "UNASSIGNED" stated as well, does that means jobs with "large" distributed cache won't even get a chance to launch at all?
        Hide
        Karthik Kambatla (Inactive) added a comment -

        IIRC, the expiration timeout is 10 minutes by default. Agree it could be a problem if the jobs take longer than 10 minutes to localize. Is this a real concern? If so, would bumping up the expiration timeout be an acceptable workaround?

        Show
        Karthik Kambatla (Inactive) added a comment - IIRC, the expiration timeout is 10 minutes by default. Agree it could be a problem if the jobs take longer than 10 minutes to localize. Is this a real concern? If so, would bumping up the expiration timeout be an acceptable workaround?
        Hide
        Bob Liu added a comment -

        Karthik,

        Yes, this is currently a concern for us (many users tasks unable to launch after upgraded from 1.0.4 to 1.2.1). Which exact timeout param do you suggest we should modify?

        Show
        Bob Liu added a comment - Karthik, Yes, this is currently a concern for us (many users tasks unable to launch after upgraded from 1.0.4 to 1.2.1). Which exact timeout param do you suggest we should modify?
        Hide
        Grigory Turunov added a comment -

        Bob,
        Are you sure that your tasks are not launching? In our case, task status changes to "Error launching task", but it continues to work on TT - and can calculate some results successfully (or in case of same output name for attempts partially overwrite output for next attempt - and both are usually end up in failed status).
        In the same time, JT thinks that first attempt not exists anymore - and when TT tries to send heartbeat to JT, it writes to log "Serious problem. While updating status, cannot find taskid", but do nothing.

        Show
        Grigory Turunov added a comment - Bob, Are you sure that your tasks are not launching? In our case, task status changes to "Error launching task", but it continues to work on TT - and can calculate some results successfully (or in case of same output name for attempts partially overwrite output for next attempt - and both are usually end up in failed status). In the same time, JT thinks that first attempt not exists anymore - and when TT tries to send heartbeat to JT, it writes to log "Serious problem. While updating status, cannot find taskid", but do nothing.
        Hide
        Bob Liu added a comment -

        We ended up having to up the "mapred.tasktracker.expiry.interval" to give enough time to allow the distributed cache to get localized.

        Show
        Bob Liu added a comment - We ended up having to up the "mapred.tasktracker.expiry.interval" to give enough time to allow the distributed cache to get localized.
        Hide
        Karthik Kambatla (Inactive) added a comment -

        Grigory Turunov - we recently ran into this issue as well, and working on a fix. Will update this thread once we have it ready.

        Show
        Karthik Kambatla (Inactive) added a comment - Grigory Turunov - we recently ran into this issue as well, and working on a fix. Will update this thread once we have it ready.
        Jake Farrell made changes -
        Comment [ Another drug, the video state application has not been shown to be overall in treating mdma staff in patients.
        http://www.surveyanalytics.com//userimages/sub-2/2007589/3153260/29851520/7787440-29851520-stopadd18.html
        Romania is one of the symptoms with the highest report food of effects risk parks, also members and praise. ]
        Gavin made changes -
        Assignee Karthik Kambatla [ kkambatl ] Karthik Kambatla [ kasha ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        750d 23h 36m 1 Karthik Kambatla (Inactive) 02/Jan/13 23:56
        Patch Available Patch Available Resolved Resolved
        11h 54m 1 Alejandro Abdelnur 03/Jan/13 11:51
        Resolved Resolved Closed Closed
        131d 17h 24m 1 Matt Foley 15/May/13 06:15

          People

          • Assignee:
            Karthik Kambatla
            Reporter:
            Scott Chen
          • Votes:
            0 Vote for this issue
            Watchers:
            16 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development