Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3473

A single task tracker failure shouldn't result in Job failure

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.20.205.0, 0.23.0
    • Fix Version/s: None
    • Component/s: tasktracker
    • Labels:
      None

      Description

      Currently some task failures may result in job failures. Eg a local TT disk failure seen in TaskLauncher#run, TaskRunner#run, MapTask#run is visible to and can hang the JobClient, causing the job to fail. Job execution should always be able to survive a task failure if there are sufficient resources.

        Issue Links

          Activity

          Eli Collins created issue -
          Eli Collins made changes -
          Field Original Value New Value
          Link This issue relates to MAPREDUCE-3121 [ MAPREDUCE-3121 ]
          Hide
          Ravi Gummadi added a comment -

          It would be good to distinguish between failures based on failure-types. AM can decide whether to reexecute/relaunch that task too many times or not based on the type of failure and number of times that task failed. No ?

          Show
          Ravi Gummadi added a comment - It would be good to distinguish between failures based on failure-types. AM can decide whether to reexecute/relaunch that task too many times or not based on the type of failure and number of times that task failed. No ?
          Hide
          Eli Collins added a comment -

          That makes sense. Either way a single task failure should never cause a job to fail right?

          Show
          Eli Collins added a comment - That makes sense. Either way a single task failure should never cause a job to fail right?
          Hide
          Ravi Gummadi added a comment -

          Currently in trunk (and even in 0.20), task gets re-launched multiple times even after failures on (say until it fails on ) 4 different nodes. Right ? The possibility of a task failing on 4 different nodes because of these special type of task-failures (like disk failures) is very very low. Right ?

          Show
          Ravi Gummadi added a comment - Currently in trunk (and even in 0.20), task gets re-launched multiple times even after failures on (say until it fails on ) 4 different nodes. Right ? The possibility of a task failing on 4 different nodes because of these special type of task-failures (like disk failures) is very very low. Right ?
          Hide
          Eli Collins added a comment -

          The issue is that the single failure can be visible to the client. Eg when the JobClient tries to get the tasklog from from a particular TT that's failed. The reliability mechanism should be invisible to the client. See also MAPREDUCE-2960 (A single TT disk failure can cause the job to fail). To see what I'm talking about run a job in a loop and then fail one of the disks on one of the TTs, some percentage of the time this single failure on a single TT can cause the entire job to fail.

          Show
          Eli Collins added a comment - The issue is that the single failure can be visible to the client. Eg when the JobClient tries to get the tasklog from from a particular TT that's failed. The reliability mechanism should be invisible to the client. See also MAPREDUCE-2960 (A single TT disk failure can cause the job to fail). To see what I'm talking about run a job in a loop and then fail one of the disks on one of the TTs, some percentage of the time this single failure on a single TT can cause the entire job to fail.
          Hide
          Sharad Agarwal added a comment -

          The knob to control job failure due to task failures are mapreduce.map.failures.maxpercent and mapreduce.reduce.failures.maxpercent. The default value is 0. Due you mean these don't work ?

          Show
          Sharad Agarwal added a comment - The knob to control job failure due to task failures are mapreduce.map.failures.maxpercent and mapreduce.reduce.failures.maxpercent. The default value is 0. Due you mean these don't work ?
          Hide
          Eli Collins added a comment -

          Ah, I didn't realize these defaulted to zero, thanks for pointing this out. Anyone know the rationale behind having jobs not tolerate a single task failure by default? From reading HADOOP-1144 it seems like this was chosen because it was the initial behavior before the code could handle task failure.

          Show
          Eli Collins added a comment - Ah, I didn't realize these defaulted to zero, thanks for pointing this out. Anyone know the rationale behind having jobs not tolerate a single task failure by default? From reading HADOOP-1144 it seems like this was chosen because it was the initial behavior before the code could handle task failure.
          Hide
          Sharad Agarwal added a comment -

          note this is task failure NOT taskattempt failures. Task failure would mean losing processing on corresponding inputsplit. Not all applications would be ok with it.
          Explicitly setting to non-zero value makes sense so losing data doesn't come as surprise for applications.

          Show
          Sharad Agarwal added a comment - note this is task failure NOT taskattempt failures. Task failure would mean losing processing on corresponding inputsplit. Not all applications would be ok with it. Explicitly setting to non-zero value makes sense so losing data doesn't come as surprise for applications.
          Hide
          Eli Collins added a comment -

          Cool, so we should re-purpose this jira for defaulting *failures.maxpercent to non-zero values? What value do people use in practice?

          Show
          Eli Collins added a comment - Cool, so we should re-purpose this jira for defaulting *failures.maxpercent to non-zero values? What value do people use in practice?
          Hide
          Mahadev konar added a comment -

          I think we do need the defaults to be 0. Changing it to anything else will be a regression. As Sharad said, losing data should not come as a surprise to users.

          Show
          Mahadev konar added a comment - I think we do need the defaults to be 0. Changing it to anything else will be a regression. As Sharad said, losing data should not come as a surprise to users.
          Hide
          Eli Collins added a comment -

          Why would a non-zero value result in a job completing successfully but with data loss?

          Show
          Eli Collins added a comment - Why would a non-zero value result in a job completing successfully but with data loss?
          Hide
          Subroto Sanyal added a comment -

          mapreduce.map.failures.maxpercent and mapreduce.reduce.failures.maxpercent hold the percentage of failure tolerance of number of Tasks for Job to handle.

          Say in case a map fails and it comes under the tolerance limit then the output of the mapper is lost(will not be considered for further computation). Same is with Reducer.

          I suggest let user decide the this failure percentage and be ready for such data loss otherwise, it will come to a surprise to user if the value is set to non-zero.

          Further I feel there won't be any correct default non-zero value for these configurations. These values depend on user scenarios/use-cases.

          Show
          Subroto Sanyal added a comment - mapreduce.map.failures.maxpercent and mapreduce.reduce.failures.maxpercent hold the percentage of failure tolerance of number of Tasks for Job to handle. Say in case a map fails and it comes under the tolerance limit then the output of the mapper is lost(will not be considered for further computation). Same is with Reducer. I suggest let user decide the this failure percentage and be ready for such data loss otherwise, it will come to a surprise to user if the value is set to non-zero. Further I feel there won't be any correct default non-zero value for these configurations. These values depend on user scenarios/use-cases.
          Hide
          Subroto Sanyal added a comment -

          @Eli
          Is the issue still a valid one?
          Any other suggestion/opinion.....

          Show
          Subroto Sanyal added a comment - @Eli Is the issue still a valid one? Any other suggestion/opinion.....
          Hide
          Eli Collins added a comment -

          @Subroto. Thanks for pinging. Users expect their jobs to complete even if a TT running one of their jobs fails right? So we're clear on terminology, what I identified is that a single TT failure can cause the job to fail. A job should be able to survive a single machine failure right?

          Show
          Eli Collins added a comment - @Subroto. Thanks for pinging. Users expect their jobs to complete even if a TT running one of their jobs fails right? So we're clear on terminology, what I identified is that a single TT failure can cause the job to fail. A job should be able to survive a single machine failure right?
          Eli Collins made changes -
          Summary Task failures shouldn't result in Job failures A single task tracker failure shouldn't result in Job failure
          Hide
          Sharad Agarwal added a comment -

          a single machine failure doesn't result in job to fail; thats the whole point of hadoop. smile.

          we are missing the difference with Task and TaskAttempt. A task gets 4 chances (task attempts) by default to run before the job is declared failed.

          I think this issue can be resolved as Invalid.

          Show
          Sharad Agarwal added a comment - a single machine failure doesn't result in job to fail; thats the whole point of hadoop. smile. we are missing the difference with Task and TaskAttempt . A task gets 4 chances (task attempts) by default to run before the job is declared failed. I think this issue can be resolved as Invalid.
          Hide
          Eli Collins added a comment -

          a single machine failure doesn't result in job to fail; thats the whole point of hadoop. smile.

          It can, that's the point of this bug!

          Show
          Eli Collins added a comment - a single machine failure doesn't result in job to fail; thats the whole point of hadoop. smile. It can, that's the point of this bug!
          Hide
          Eli Collins added a comment -

          MAPREDUCE-2960 contains details for a specific example. I think what's going on is that the ability to tolerate disk failures now means you can get a set of task attempt failures on a single TT that would have just been one (because the TT used to stop itself when it saw a disk failure).

          Show
          Eli Collins added a comment - MAPREDUCE-2960 contains details for a specific example. I think what's going on is that the ability to tolerate disk failures now means you can get a set of task attempt failures on a single TT that would have just been one (because the TT used to stop itself when it saw a disk failure).

            People

            • Assignee:
              Unassigned
              Reporter:
              Eli Collins
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Development