Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-42

task trackers should not restart for having a late heartbeat

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Invalid
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      TaskTrackers should not close and restart themselves for having a late heartbeat. The JobTracker should just accept their current status.

        Issue Links

          Activity

          Hide
          Devaraj Das added a comment -

          Attached is the patch.

          Show
          Devaraj Das added a comment - Attached is the patch.
          Hide
          Doug Cutting added a comment -

          It strikes me that we should instead improve detection of task tracker death, rather than to try to revive things when we mistakenly decide one is dead. Thoughts?

          Show
          Doug Cutting added a comment - It strikes me that we should instead improve detection of task tracker death, rather than to try to revive things when we mistakenly decide one is dead. Thoughts?
          Hide
          Devaraj Das added a comment -

          I am open to a discussion on this but I thought that while we are discussing the possible solutions to the TaskTracker heartbeat problem in Hadoop-312 (I am actually testing a patch with increased queue length + retries), I think we can consider this patch as a solution to the problem raised in this Jira issue. Doesn't harm I think.

          Show
          Devaraj Das added a comment - I am open to a discussion on this but I thought that while we are discussing the possible solutions to the TaskTracker heartbeat problem in Hadoop-312 (I am actually testing a patch with increased queue length + retries), I think we can consider this patch as a solution to the problem raised in this Jira issue. Doesn't harm I think.
          Hide
          Sameer Paranjpye added a comment -

          I feel that improved detection of tasktracker death is a separate issue, which needs addressing. At the same time, we need to try and not lose work if communication between a tasktracker and the jobtracker fails for some reason.

          For instance, a tasktracker may appear lost to the jobtracker due to transient network problems. In such a case it would be ok for the jobtracker to mark the lost tasks as failed and reschedule them to other places.
          If communication to the jobtracker is subsequently restored, while the job is still in progress, the jobtracker can
          easily mark the lost and found tasks as succeeded. Multiple instances of a task should be handled by the speculative execution code. It seems like we could avoid losing a lot of work if we had such as mechanism in place.

          Show
          Sameer Paranjpye added a comment - I feel that improved detection of tasktracker death is a separate issue, which needs addressing. At the same time, we need to try and not lose work if communication between a tasktracker and the jobtracker fails for some reason. For instance, a tasktracker may appear lost to the jobtracker due to transient network problems. In such a case it would be ok for the jobtracker to mark the lost tasks as failed and reschedule them to other places. If communication to the jobtracker is subsequently restored, while the job is still in progress, the jobtracker can easily mark the lost and found tasks as succeeded. Multiple instances of a task should be handled by the speculative execution code. It seems like we could avoid losing a lot of work if we had such as mechanism in place.
          Hide
          eric baldeschwieler added a comment -

          Would this have a substantial impact on the large sort benchmark tests our team runs?

          Show
          eric baldeschwieler added a comment - Would this have a substantial impact on the large sort benchmark tests our team runs?
          Hide
          Doug Cutting added a comment -

          > improved detection of tasktracker death is a separate issue

          It's certainly related. This issue deals with fixing things when tasktracker deaths are mis-detected. If we didn't misdetect so much, this would not be an issue.

          > Multiple instances of a task should be handled by the speculative execution code.

          What if speculative execution is disabled in the job? Then we'd get multiple instances of the task running at once, when the client explicitly requested that not happen.

          I'm +0 on this patch. If others feel strongly that it's the best approach, I won't veto it. But I would prefer we address the root problem first, and then see if this is still an issue, before adding this new mechanism. Does that make sense, or am I missing something?

          Show
          Doug Cutting added a comment - > improved detection of tasktracker death is a separate issue It's certainly related. This issue deals with fixing things when tasktracker deaths are mis-detected. If we didn't misdetect so much, this would not be an issue. > Multiple instances of a task should be handled by the speculative execution code. What if speculative execution is disabled in the job? Then we'd get multiple instances of the task running at once, when the client explicitly requested that not happen. I'm +0 on this patch. If others feel strongly that it's the best approach, I won't veto it. But I would prefer we address the root problem first, and then see if this is still an issue, before adding this new mechanism. Does that make sense, or am I missing something?
          Hide
          Yoram Arnon added a comment -

          Eric, we don't get many lost task trackers in our sort benchmark any more, so I don't expect a significant improvement.

          In other jobs though, we do get the occasional lost task tracker. While every such case is a bug when nothing 'real' happens (HW error, network etc.) hunting every last one of them down is going to take quite a while.
          Minimizing the damage caused by such a glitch would be welcome if it doesn't cause adverse side effects.

          Show
          Yoram Arnon added a comment - Eric, we don't get many lost task trackers in our sort benchmark any more, so I don't expect a significant improvement. In other jobs though, we do get the occasional lost task tracker. While every such case is a bug when nothing 'real' happens (HW error, network etc.) hunting every last one of them down is going to take quite a while. Minimizing the damage caused by such a glitch would be welcome if it doesn't cause adverse side effects.
          Hide
          Devaraj Das added a comment -

          Doug, does it make sense to do what is done in this patch only when speculative execution is on?

          Show
          Devaraj Das added a comment - Doug, does it make sense to do what is done in this patch only when speculative execution is on?
          Hide
          eric baldeschwieler added a comment -

          Oh a whole thread can be had on this I'm sure!

          Why does one turn off speculative execution? Presumably because a MAP has unmanaged side-effects?

          But... the framework still will rerun jobs if they complete and then the node is lost, right? Won't this tickle exactly the same issues that speculative execution raised anyway?

          Doesn't this imply that disallowing speculative execution is basically not the right mechanism to deal with the side-effect issue and that this deserves a rethink?

          Show
          eric baldeschwieler added a comment - Oh a whole thread can be had on this I'm sure! Why does one turn off speculative execution? Presumably because a MAP has unmanaged side-effects? But... the framework still will rerun jobs if they complete and then the node is lost, right? Won't this tickle exactly the same issues that speculative execution raised anyway? Doesn't this imply that disallowing speculative execution is basically not the right mechanism to deal with the side-effect issue and that this deserves a rethink?
          Hide
          eric baldeschwieler added a comment -

          On reintegrating lost task trackers...

          It does seem like we should do this to me, but we need to make sure we reason through how this effects corner cases, what invariants the system does maintain and so on.

          I suggest we work this through, and then go forward with this patch (modified if we find any corner cases) and post the reasoning, so we can review it as this logic evolves. (And update any existing documentation in this area of course...)

          Show
          eric baldeschwieler added a comment - On reintegrating lost task trackers... It does seem like we should do this to me, but we need to make sure we reason through how this effects corner cases, what invariants the system does maintain and so on. I suggest we work this through, and then go forward with this patch (modified if we find any corner cases) and post the reasoning, so we can review it as this logic evolves. (And update any existing documentation in this area of course...)
          Hide
          Doug Cutting added a comment -

          > Why does one turn off speculative execution?

          In the case of the Nutch crawler, speculative execution is disabled to observe politeness. We do not want two tasks to attempt to fetch pages from a site at the same time.

          This patch adds a fair amount of complexity, introducing a new state for tasks (presumed dead, but reanimateable). A new state is likely to add new failure modes.

          Does anyone deny that this primarily addresses an issue that would go away if we could more reliably detect tasktracker death? Shouldn't we attempt to fix that first? Sameer raises the issue of "transient network problems". Are we actually seeing these? Even if these were to occur, the system would operate correctly as-is: this is an optimization. Is this a common-enough case that we can afford to optimize it?

          Show
          Doug Cutting added a comment - > Why does one turn off speculative execution? In the case of the Nutch crawler, speculative execution is disabled to observe politeness. We do not want two tasks to attempt to fetch pages from a site at the same time. This patch adds a fair amount of complexity, introducing a new state for tasks (presumed dead, but reanimateable). A new state is likely to add new failure modes. Does anyone deny that this primarily addresses an issue that would go away if we could more reliably detect tasktracker death? Shouldn't we attempt to fix that first? Sameer raises the issue of "transient network problems". Are we actually seeing these? Even if these were to occur, the system would operate correctly as-is: this is an optimization. Is this a common-enough case that we can afford to optimize it?
          Hide
          Owen O'Malley added a comment -

          I agree that the original desire for this patch was born of the TaskTracker timeouts that shouldn't happen. Fixing those problems (and we have fixed most of them over the last 4 months) should take precendence. However, that said, I think in the long term, we do want something like this patch. If a switch goes down for 15 minutes and then comes back up, it does not make sense to reshuffle, resort, and rerun a reduce that takes hours to run.

          All map/reduce applications, even those with speculative execution turned off, must permit redundant copies of their tasks for precisely this reason. In this case, the JobTracker has decided a given task is dead, but hasn't been able to tell the responsible TaskTracker yet. Therefore it schedules another instance of the failed task on a different node. Therefore, they are going to run in parallel for a while.

          I guess for now, let's sit on this patch and contemplate what the model should be for dealing with communication problems. We should also monitor this in real use and see how often task trackers are being lost and probably put some effort to determine at least whether it is the job tracker or the task tracker that is the cause of the delay.

          Show
          Owen O'Malley added a comment - I agree that the original desire for this patch was born of the TaskTracker timeouts that shouldn't happen. Fixing those problems (and we have fixed most of them over the last 4 months) should take precendence. However, that said, I think in the long term, we do want something like this patch. If a switch goes down for 15 minutes and then comes back up, it does not make sense to reshuffle, resort, and rerun a reduce that takes hours to run. All map/reduce applications, even those with speculative execution turned off, must permit redundant copies of their tasks for precisely this reason. In this case, the JobTracker has decided a given task is dead, but hasn't been able to tell the responsible TaskTracker yet. Therefore it schedules another instance of the failed task on a different node. Therefore, they are going to run in parallel for a while. I guess for now, let's sit on this patch and contemplate what the model should be for dealing with communication problems. We should also monitor this in real use and see how often task trackers are being lost and probably put some effort to determine at least whether it is the job tracker or the task tracker that is the cause of the delay.
          Hide
          Doug Cutting added a comment -

          > If a switch goes down for 15 minutes [ ... ]

          We'll currently have a lot of other problems if a switch goes down for 15 minutes. All of the other tasks will probably fail because DFS will no longer have complete copies of files.

          Is a switch going down for 15 minutes really a case we need to optimize? Is it acceptable to lose a few hours work on its hosts when a switch dies? When a switch fails, how long does it take to replace?

          We can answer some of this fairly precisely. What is the MTBF for switches? How many switches would we have in a 10k-node system?

          Show
          Doug Cutting added a comment - > If a switch goes down for 15 minutes [ ... ] We'll currently have a lot of other problems if a switch goes down for 15 minutes. All of the other tasks will probably fail because DFS will no longer have complete copies of files. Is a switch going down for 15 minutes really a case we need to optimize? Is it acceptable to lose a few hours work on its hosts when a switch dies? When a switch fails, how long does it take to replace? We can answer some of this fairly precisely. What is the MTBF for switches? How many switches would we have in a 10k-node system?
          Hide
          Doug Cutting added a comment -

          I doubt this patch applies cleanly anymore, and there was doubt about whether this was the best approach...

          Show
          Doug Cutting added a comment - I doubt this patch applies cleanly anymore, and there was doubt about whether this was the best approach...
          Hide
          Harsh J added a comment -

          This was decided as not a good approach. Also gone stale.

          Show
          Harsh J added a comment - This was decided as not a good approach. Also gone stale.

            People

            • Assignee:
              Devaraj Das
              Reporter:
              Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development