Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2129

Job may hang if mapreduce.job.committer.setup.cleanup.needed=false and mapreduce.map/reduce.failures.maxpercent>0

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.2, 0.20.3, 0.21.1
    • Fix Version/s: 1.1.0
    • Component/s: jobtracker
    • Labels:
    • Tags:
      setup/cleanup
    • Target Version/s:

      Description

      Job may hang at RUNNING state if mapreduce.job.committer.setup.cleanup.needed=false and mapreduce.map/reduce.failures.maxpercent>0. It happens when some tasks fail but havent reached failures.maxpercent.

      1. MAPREDUCE-2129.patch
        3 kB
        Thomas Graves
      2. MAPREDUCE-2129.patch
        2 kB
        Ravi Prakash
      3. MAPREDUCE-2129.patch
        1 kB
        Subroto Sanyal
      4. MAPREDUCE-2129.patch
        0.8 kB
        Kang Xiao

        Activity

        Hide
        Kang Xiao added a comment -

        Here is an example:

        • The job has 100 maps and no reduce
        • mapreduce.job.committer.setup.cleanup.needed=false
        • mapreduce.map/reduce.failures.maxpercent=5, so at most 5 map tip is allowed to fail
        • 99 maps successed
        • the last map failed 4 attempts, then the last TIP failed
        • the failed TIP will not cause the job to fail since 1 < 5
        • no cleanup task will be lanuched since mapreduce.job.committer.setup.cleanup.needed=false
        • jobComplete() at the tail of completedTask() has no chance to be invoked, so job hangs at RUNNING state
        Show
        Kang Xiao added a comment - Here is an example: The job has 100 maps and no reduce mapreduce.job.committer.setup.cleanup.needed=false mapreduce.map/reduce.failures.maxpercent=5, so at most 5 map tip is allowed to fail 99 maps successed the last map failed 4 attempts, then the last TIP failed the failed TIP will not cause the job to fail since 1 < 5 no cleanup task will be lanuched since mapreduce.job.committer.setup.cleanup.needed=false jobComplete() at the tail of completedTask() has no chance to be invoked, so job hangs at RUNNING state
        Hide
        Kang Xiao added a comment -

        One solution is to complete job if (!jobSetupCleanupNeeded && canLaunchJobCleanupTask()) in JobInProgress.failedTask().

        Patch attached for this solution.

        Show
        Kang Xiao added a comment - One solution is to complete job if (!jobSetupCleanupNeeded && canLaunchJobCleanupTask()) in JobInProgress.failedTask(). Patch attached for this solution.
        Hide
        Subroto Sanyal added a comment -

        Hi Kang,
        Can you please provide more specific test case about this problem?
        I have tried the same scenario with 111 Maps and 0 Reduces (Modified WordCount). The issue didn't get reproduced.
        Following conf was added in mapred-site.xml:

        <property>
          <name>mapreduce.job.committer.setup.cleanup.needed</name>
          <value>false</value>
          <description> true, if job needs job-setup and job-cleanup.
                        false, otherwise  
          </description>
        </property>
        
        <property>
          <name>mapreduce.map.failures.maxpercent</name>
          <value>5</value>
          <description></description>
          </property>
        
        Show
        Subroto Sanyal added a comment - Hi Kang, Can you please provide more specific test case about this problem? I have tried the same scenario with 111 Maps and 0 Reduces (Modified WordCount). The issue didn't get reproduced. Following conf was added in mapred-site.xml: <property> <name>mapreduce.job.committer.setup.cleanup.needed</name> <value>false</value> <description> true, if job needs job-setup and job-cleanup. false, otherwise </description> </property> <property> <name>mapreduce.map.failures.maxpercent</name> <value>5</value> <description></description> </property>
        Hide
        Subroto Sanyal added a comment -

        Aaah....got the problem
        The root cause for problem is:
        Before scheduling reduces a check is made in JobInProgress: boolean org.apache.hadoop.mapred.JobInProgress.scheduleReduces() which makes the following check:
        finishedMapTasks >= completedMapsForReduceSlowstart
        This check is valid if the value of mapred.max.map.failures.percent is set to 0(zero/default value) but, when this value is set, the a/m check is invalid.
        Say for example a Job spawns 100 Map task and the property value for mapred.max.map.failures.percent is set to 5 percent. In this scenario even if 95 Maps are successful, reducers should be scheduled. Now if we look back the a/m check, then the condition will not satisfy ever(if 5 map task fail) because
        95 >= 100 will be always false.

        As per my understanding the issue has nothing to with mapreduce.job.committer.setup.cleanup.needed.

        Kang,

        We can't call the void org.apache.hadoop.mapred.JobInProgress.jobComplete() from the void org.apache.hadoop.mapred.JobInProgress.failedTask(TaskInProgress tip, TaskAttemptID taskid, TaskStatus status, TaskTracker taskTracker, boolean wasRunning, boolean wasComplete, boolean wasAttemptRunning) as the method will called upon failure of a Task but, we need to wait till completedMapsForReduceSlowstart is reached,so that reduces are spawned.

        In my scenario there was a job(wordcount) with 111 Mappers.
        The value of mapred.reduce.slowstart.completed.maps was set to 1 (100%). The value for mapreduce.map.failures.maxpercent was set to 5(5%).The Mapper implementation was tweaked in such a way that 4 mappers failed.
        After some time I noticed that 107 mappers got completed but, reduces are not running and it got stuck for indefinite time.

        Show
        Subroto Sanyal added a comment - Aaah....got the problem The root cause for problem is: Before scheduling reduces a check is made in JobInProgress: boolean org.apache.hadoop.mapred.JobInProgress.scheduleReduces() which makes the following check: finishedMapTasks >= completedMapsForReduceSlowstart This check is valid if the value of mapred.max.map.failures.percent is set to 0(zero/default value) but, when this value is set, the a/m check is invalid. Say for example a Job spawns 100 Map task and the property value for mapred.max.map.failures.percent is set to 5 percent. In this scenario even if 95 Maps are successful, reducers should be scheduled. Now if we look back the a/m check, then the condition will not satisfy ever(if 5 map task fail) because 95 >= 100 will be always false. As per my understanding the issue has nothing to with mapreduce.job.committer.setup.cleanup.needed . Kang, We can't call the void org.apache.hadoop.mapred.JobInProgress.jobComplete() from the void org.apache.hadoop.mapred.JobInProgress.failedTask(TaskInProgress tip, TaskAttemptID taskid, TaskStatus status, TaskTracker taskTracker, boolean wasRunning, boolean wasComplete, boolean wasAttemptRunning) as the method will called upon failure of a Task but, we need to wait till completedMapsForReduceSlowstart is reached,so that reduces are spawned. In my scenario there was a job(wordcount) with 111 Mappers. The value of mapred.reduce.slowstart.completed.maps was set to 1 (100%). The value for mapreduce.map.failures.maxpercent was set to 5(5%).The Mapper implementation was tweaked in such a way that 4 mappers failed. After some time I noticed that 107 mappers got completed but, reduces are not running and it got stuck for indefinite time.
        Hide
        Subroto Sanyal added a comment -

        Missed to mention one point:
        the number of reducer doesn't impact the issue.
        In my test environment, the Job had 2 Reducers.

        Show
        Subroto Sanyal added a comment - Missed to mention one point: the number of reducer doesn't impact the issue. In my test environment, the Job had 2 Reducers.
        Hide
        Kang Xiao added a comment -

        HI, Subroto, sorry for late response.

        1. scheduleReduces() you found is really another issuse in which reduce tasks will not be scheduled
        2. The original issuse I reported requires that the last map fail ("the last map failed 4 attempts, then the last TIP failed"). If the last completed map successes, the job still goes to SUCCESSED state.

        Show
        Kang Xiao added a comment - HI, Subroto, sorry for late response. 1. scheduleReduces() you found is really another issuse in which reduce tasks will not be scheduled 2. The original issuse I reported requires that the last map fail ("the last map failed 4 attempts, then the last TIP failed"). If the last completed map successes, the job still goes to SUCCESSED state.
        Hide
        Subroto Sanyal added a comment -

        Hi Kang,

        Can you please provide a testcase?
        Please provide the version also in which you actually found the bug? As per my knowledge in 0.20.1 version there is no existence of the property mapreduce.job.committer.setup.cleanup.needed.

        Show
        Subroto Sanyal added a comment - Hi Kang, Can you please provide a testcase? Please provide the version also in which you actually found the bug? As per my knowledge in 0.20.1 version there is no existence of the property mapreduce.job.committer.setup.cleanup.needed .
        Hide
        yuling added a comment -

        good patch,got it.
        mapreduce.job.committer.setup.cleanup.needed is add by MAPREDUCE-463.

        Show
        yuling added a comment - good patch,got it. mapreduce.job.committer.setup.cleanup.needed is add by MAPREDUCE-463 .
        Hide
        Arun C Murthy added a comment -

        Sorry to come in late, the patch has gone stale. Can you please rebase? Thanks.

        Show
        Arun C Murthy added a comment - Sorry to come in late, the patch has gone stale. Can you please rebase? Thanks.
        Hide
        Subroto Sanyal added a comment -

        This issue fix is still applicable for 1.0 and 1.1.0
        Request review for the patch

        Show
        Subroto Sanyal added a comment - This issue fix is still applicable for 1.0 and 1.1.0 Request review for the patch
        Hide
        Subroto Sanyal added a comment -

        This issue is not applicable for trunk.

        Show
        Subroto Sanyal added a comment - This issue is not applicable for trunk.
        Hide
        Ravi Gummadi added a comment -

        In trunk, I saw modified wordCount job hanging. I made 1 of the map tasks(out of 8 maps) fail. Web UI is showing that all 8 maps are completed and 1 out of them failed. Reduces are stuck at around 29% forever. Looks like waiting for failed-map's-output ? Related to this JIRA ?

        Show
        Ravi Gummadi added a comment - In trunk, I saw modified wordCount job hanging. I made 1 of the map tasks(out of 8 maps) fail. Web UI is showing that all 8 maps are completed and 1 out of them failed. Reduces are stuck at around 29% forever. Looks like waiting for failed-map's-output ? Related to this JIRA ?
        Hide
        Ravi Gummadi added a comment -

        OK. The issue I mentioned in previous comment seems to be tracked at MAPREDUCE-4013

        Show
        Ravi Gummadi added a comment - OK. The issue I mentioned in previous comment seems to be tracked at MAPREDUCE-4013
        Hide
        Ravi Prakash added a comment -

        Thanks Kang and Subroto for your patches. I've rebased it on branch-1.0

        Show
        Ravi Prakash added a comment - Thanks Kang and Subroto for your patches. I've rebased it on branch-1.0
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12519970/MAPREDUCE-2129.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2100//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12519970/MAPREDUCE-2129.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2100//console This message is automatically generated.
        Hide
        Ravi Prakash added a comment -

        This patch applies to branch-1.0. Not trunk. That's why Hadoop QA is complaining

        Show
        Ravi Prakash added a comment - This patch applies to branch-1.0. Not trunk. That's why Hadoop QA is complaining
        Hide
        Harsh J added a comment -

        This issue is not applicable for trunk.

        Can you also elaborate why, and how trunk has it fixed? Thanks!

        Also, can we have an additional, fuller e2e test (or probably extend the max-failures-percent tests to toggle setup/cleanup as well), just for completion's sake?

        Show
        Harsh J added a comment - This issue is not applicable for trunk. Can you also elaborate why, and how trunk has it fixed? Thanks! Also, can we have an additional, fuller e2e test (or probably extend the max-failures-percent tests to toggle setup/cleanup as well), just for completion's sake?
        Hide
        Jason Lowe added a comment -

        Trunk doesn't have this problem because JobImpl.getCompletedMaps() includes failed and killed tasks as well as succeeded tasks. RMContainerAllocator uses getCompletedMaps() to check against the slowstart threshold for launching reducers, so it is already accounting for map tasks that may not have succeeded.

        Show
        Jason Lowe added a comment - Trunk doesn't have this problem because JobImpl.getCompletedMaps() includes failed and killed tasks as well as succeeded tasks. RMContainerAllocator uses getCompletedMaps() to check against the slowstart threshold for launching reducers, so it is already accounting for map tasks that may not have succeeded.
        Hide
        Thomas Graves added a comment -

        Harsh, which tests specifically are you referring to, I'm not seeing any existing max-failures-percent tests in 1.0?

        The testcase needs upmerged to latest branch-2 and the @Test in front of it so it actually runs. I'll see if I can add another test that actually uses the config settings.

        I've also tested this fix manually on small cluster and it works with this patch.

        Show
        Thomas Graves added a comment - Harsh, which tests specifically are you referring to, I'm not seeing any existing max-failures-percent tests in 1.0? The testcase needs upmerged to latest branch-2 and the @Test in front of it so it actually runs. I'll see if I can add another test that actually uses the config settings. I've also tested this fix manually on small cluster and it works with this patch.
        Hide
        Thomas Graves added a comment -

        typo: branch-2 should be branch-1.

        Show
        Thomas Graves added a comment - typo: branch-2 should be branch-1.
        Hide
        Thomas Graves added a comment -

        add another test setting the actually configs and running a job, fixed missing @Test and upmerged to latest branch-2.

        Show
        Thomas Graves added a comment - add another test setting the actually configs and running a job, fixed missing @Test and upmerged to latest branch-2.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12535250/MAPREDUCE-2129.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2545//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12535250/MAPREDUCE-2129.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2545//console This message is automatically generated.
        Hide
        Robert Joseph Evans added a comment -

        Tom do you mean branch-1 or branch-2? The patch looks fine to me on branch-1 +1.

        Show
        Robert Joseph Evans added a comment - Tom do you mean branch-1 or branch-2? The patch looks fine to me on branch-1 +1.
        Hide
        Thomas Graves added a comment -

        I meant branch-1. Thanks Bobby!

        Show
        Thomas Graves added a comment - I meant branch-1. Thanks Bobby!
        Hide
        Thomas Graves added a comment -

        I've committed this to branch-1. It would be nice to pull this into the 1.1.0 release if possible.

        Show
        Thomas Graves added a comment - I've committed this to branch-1. It would be nice to pull this into the 1.1.0 release if possible.
        Hide
        Matt Foley added a comment -

        Committed to branch-1.1 for 1.1.0.
        Fixed location in branch-1/CHANGES.txt.

        I sure hope they put back the SVN/JIRA integration plugin soon!

        Show
        Matt Foley added a comment - Committed to branch-1.1 for 1.1.0. Fixed location in branch-1/CHANGES.txt. I sure hope they put back the SVN/JIRA integration plugin soon!
        Hide
        Matt Foley added a comment -

        Closed upon release of Hadoop-1.1.0.

        Show
        Matt Foley added a comment - Closed upon release of Hadoop-1.1.0.

          People

          • Assignee:
            Subroto Sanyal
            Reporter:
            Kang Xiao
          • Votes:
            1 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development