Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.4.0
    • Fix Version/s: 2.6.0
    • Component/s: applicationmaster, mrv2
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Found this while reviewing YARN-2074. The problem is that after YARN-2074, we don't count AM preemption towards AM failures on RM side, but MapReduce AM itself checks the attempt id against the max-attempt count to determine if this is the last attempt.

          public void computeIsLastAMRetry() {
            isLastAMRetry = appAttemptID.getAttemptId() >= maxAppAttempts;
          }
      

      This causes issues w.r.t deletion of staging directory etc..

      1. MR-5956.patch
        20 kB
        Wangda Tan
      2. MR-5956.patch
        14 kB
        Wangda Tan

        Issue Links

          Activity

          Hide
          Vinod Kumar Vavilapalli added a comment -

          Given YARN-2074 is in, this is a blocker for 2.5, marking so..

          Show
          Vinod Kumar Vavilapalli added a comment - Given YARN-2074 is in, this is a blocker for 2.5, marking so..
          Hide
          Wangda Tan added a comment -

          Assigned it to me, already started working on this ..

          Show
          Wangda Tan added a comment - Assigned it to me, already started working on this ..
          Hide
          Karthik Kambatla (Inactive) added a comment -

          Wangda Tan - thanks for picking this up. Can we get to this soon? If you are caught up otherwise, I can take a stab.

          Show
          Karthik Kambatla (Inactive) added a comment - Wangda Tan - thanks for picking this up. Can we get to this soon? If you are caught up otherwise, I can take a stab.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Here's what I think are potential solutions and their problems

          1. YARN informs AM that it is the last retry as part of AM start-up or the register API
          2. YARN informs the AM that this is the last retry as part of AM unregister
          3. YARN has a way to run a separate cleanup container after it knows for sure that the application finished exhausting all its attempts

          (1) is not really possible. At best, RM can say that this 'mayBeTheLastAttempt'. So AM cannot really assume that this is the last retry and so cannot do stuff like cleaning the staging directory.

          (2) is fine enough for successful code-path. In fact, we already have a way of telling the AM that unregister succeeded and that this indeed is the last retry. We don't need a new API. If RM crashed/failed-over before that, app will have a new retry anyways. Downside of this approach is that, there are so many cases where app's last retry may have crashed (say OOM) and so doesn't cleanup stale files. In fact, any solution that relies on such RM-AM communication will not really solve those corner cases.

          (3) is an acknowledgement of the fact that a solution to the problem of cleanup of stale-files is not possible without explicit help from RM. The more I think, the more it appears to me that this is the right solution. Filing a ticket, but this will take a while and so we may have to just do (2) for the time being..

          Show
          Vinod Kumar Vavilapalli added a comment - Here's what I think are potential solutions and their problems YARN informs AM that it is the last retry as part of AM start-up or the register API YARN informs the AM that this is the last retry as part of AM unregister YARN has a way to run a separate cleanup container after it knows for sure that the application finished exhausting all its attempts (1) is not really possible. At best, RM can say that this 'mayBeTheLastAttempt'. So AM cannot really assume that this is the last retry and so cannot do stuff like cleaning the staging directory. (2) is fine enough for successful code-path. In fact, we already have a way of telling the AM that unregister succeeded and that this indeed is the last retry. We don't need a new API. If RM crashed/failed-over before that, app will have a new retry anyways. Downside of this approach is that, there are so many cases where app's last retry may have crashed (say OOM) and so doesn't cleanup stale files. In fact, any solution that relies on such RM-AM communication will not really solve those corner cases. (3) is an acknowledgement of the fact that a solution to the problem of cleanup of stale-files is not possible without explicit help from RM. The more I think, the more it appears to me that this is the right solution. Filing a ticket, but this will take a while and so we may have to just do (2) for the time being..
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Filing a ticket, but this will take a while and so we may have to just do (2) for the time being..

          Filed YARN-2261.

          Show
          Vinod Kumar Vavilapalli added a comment - Filing a ticket, but this will take a while and so we may have to just do (2) for the time being.. Filed YARN-2261 .
          Hide
          Hitesh Shah added a comment -

          Vinod Kumar Vavilapalli By definition, if an AM calls unregister, it is telling the RM that this is my last attempt and the app should not be retried. Are now you saying that all attempts should now call unregisterAttempt() which will tell the app whether it is the final attempt and should call a final unregister()? If not, I think something else is needed as an AM will only call unregister() on an error if it thinks it is the last attempt.

          Show
          Hitesh Shah added a comment - Vinod Kumar Vavilapalli By definition, if an AM calls unregister, it is telling the RM that this is my last attempt and the app should not be retried. Are now you saying that all attempts should now call unregisterAttempt() which will tell the app whether it is the final attempt and should call a final unregister()? If not, I think something else is needed as an AM will only call unregister() on an error if it thinks it is the last attempt.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          (AMRMClient|AMRMClientAsync).unregisterApplicationMaster() is a blocking call. If any attempt calls this API, and it succeeds, this AM is the last retry - the AM can go ahead and do its cleanup. All other attempts (which either don't call this API or which failed before the API returned) do not need to do any cleanup - of course there are corner cases where this is not sufficient.

          For that and all the failing cases, the only comprehensive solution I can think of is YARN-2261.

          Show
          Vinod Kumar Vavilapalli added a comment - (AMRMClient|AMRMClientAsync).unregisterApplicationMaster() is a blocking call. If any attempt calls this API, and it succeeds, this AM is the last retry - the AM can go ahead and do its cleanup. All other attempts (which either don't call this API or which failed before the API returned) do not need to do any cleanup - of course there are corner cases where this is not sufficient. For that and all the failing cases, the only comprehensive solution I can think of is YARN-2261 .
          Hide
          Wangda Tan added a comment -

          Thanks thoughts provided by Vinod Kumar Vavilapalli, had a offline discussion with Vinod, post summary here,

          Basically there're 3 cases need cleanup.
          a. Job completed (failed or succeeded, no matter it's lastRetry or not)
          b. Failure happened, and captured by MRAppMasterShutDownHook
          c. Failure happened, and doesn't capture by MRAppMasterShutDownHook

          And for thoughts provided by Vinod,

          1. YARN informs AM that it is the last retry as part of AM start-up or the register API
          2. YARN informs the AM that this is the last retry as part of AM unregister
          3. YARN has a way to run a separate cleanup container after it knows for sure that the application finished exhausting all its attempts
          

          (1) can solve a. and part of b.
          Why only part of b? Because it is possible MRAppMasterShutdownHook triggered but other possible failure happened causing cleanup not completed.
          (2) can only solve a.
          Reason is, if we don't have isLastRetry (or mayBeTheLastAttempt) properly set at register, we don't know if should do cleanup or not.
          (3) can solve a. b. c.
          Refer to YARN-2261 for more details.

          I tried to work on (1) first, however, I found moving isLastRetry setup from MRAppMaster.init to RMCommunicator cause a lots code changes and lots of unit test failures, etc.
          So my suggestion is quickly finish (2), make job completed case correct, which is the most usual case. And push (3) forward.

          I'll upload a patch in method (2) for review soon.

          Thanks,
          Wangda

          Show
          Wangda Tan added a comment - Thanks thoughts provided by Vinod Kumar Vavilapalli , had a offline discussion with Vinod, post summary here, Basically there're 3 cases need cleanup. a. Job completed (failed or succeeded, no matter it's lastRetry or not) b. Failure happened, and captured by MRAppMasterShutDownHook c. Failure happened, and doesn't capture by MRAppMasterShutDownHook And for thoughts provided by Vinod, 1. YARN informs AM that it is the last retry as part of AM start-up or the register API 2. YARN informs the AM that this is the last retry as part of AM unregister 3. YARN has a way to run a separate cleanup container after it knows for sure that the application finished exhausting all its attempts (1) can solve a. and part of b. Why only part of b? Because it is possible MRAppMasterShutdownHook triggered but other possible failure happened causing cleanup not completed. (2) can only solve a. Reason is, if we don't have isLastRetry (or mayBeTheLastAttempt) properly set at register, we don't know if should do cleanup or not. (3) can solve a. b. c. Refer to YARN-2261 for more details. I tried to work on (1) first, however, I found moving isLastRetry setup from MRAppMaster.init to RMCommunicator cause a lots code changes and lots of unit test failures, etc. So my suggestion is quickly finish (2), make job completed case correct, which is the most usual case. And push (3) forward. I'll upload a patch in method (2) for review soon. Thanks, Wangda
          Hide
          Zhijie Shen added a comment -

          b. Failure happened, and captured by MRAppMasterShutDownHook

          How can (2) work for b? Since MR AM doesn't know the preemption, the only possibility is that MR AM thinks it's not last retry, but RM thinks it is (RM may also think it's not last retry but with one fewer attempt). In this case, MR AM want to get the right last retry flag from RM. However, RMCommunicator is not supposed to do unregistration if RM AM doesn't think it's the last retry now. Hence I'm afraid MR AM doesn't have the chance to communicate with RM to inquiry the right information, unless the logic to trigger unregistration is modified. Please correct me if i'm missing something.

          Show
          Zhijie Shen added a comment - b. Failure happened, and captured by MRAppMasterShutDownHook How can (2) work for b? Since MR AM doesn't know the preemption, the only possibility is that MR AM thinks it's not last retry, but RM thinks it is (RM may also think it's not last retry but with one fewer attempt). In this case, MR AM want to get the right last retry flag from RM. However, RMCommunicator is not supposed to do unregistration if RM AM doesn't think it's the last retry now. Hence I'm afraid MR AM doesn't have the chance to communicate with RM to inquiry the right information, unless the logic to trigger unregistration is modified. Please correct me if i'm missing something.
          Hide
          Zhijie Shen added a comment -

          Talked to Wangda offline. Basically, the current solution is not to compute isLastRetry at the beginning of MR AM lifecycle, and keep it false (except some corner cases). In this way, no matter the scenario a, b or c. MR AM is going to have a retry upon failure, and RM will decide whether the MR job still has a chance with preemption considered. Therefore, MR AM will always not lose the retry chance it should have, but it trades off the problem that the staging dir is not going to be cleaned up at the real last retry, which is going to be taken care of by YARN-2261. It's now clear to me, thanks Wangda! And +1 for the plan.

          Show
          Zhijie Shen added a comment - Talked to Wangda offline. Basically, the current solution is not to compute isLastRetry at the beginning of MR AM lifecycle, and keep it false (except some corner cases). In this way, no matter the scenario a, b or c. MR AM is going to have a retry upon failure, and RM will decide whether the MR job still has a chance with preemption considered. Therefore, MR AM will always not lose the retry chance it should have, but it trades off the problem that the staging dir is not going to be cleaned up at the real last retry, which is going to be taken care of by YARN-2261 . It's now clear to me, thanks Wangda! And +1 for the plan.
          Hide
          Wangda Tan added a comment -

          Attached a fix for this, please kindly review.
          Vinod Kumar Vavilapalli, Zhijie Shen.

          Show
          Wangda Tan added a comment - Attached a fix for this, please kindly review. Vinod Kumar Vavilapalli , Zhijie Shen .
          Hide
          Mayank Bansal added a comment -

          Hi Vinod Kumar Vavilapalli Tan, Wangda

          I am little confused here , so we are saying we will retry AM indefinitely until it succeeds and calls unregister, Am I missing something?
          AM failures could be preemption or could be legitimate failures.

          Thanks,
          Mayank

          Show
          Mayank Bansal added a comment - Hi Vinod Kumar Vavilapalli Tan, Wangda I am little confused here , so we are saying we will retry AM indefinitely until it succeeds and calls unregister, Am I missing something? AM failures could be preemption or could be legitimate failures. Thanks, Mayank
          Hide
          Wangda Tan added a comment -

          Hi Mayank Bansal,
          It should say, we will retry AM indefinitely until it completed and calls unregister. AM complete includes various states like job failed/killed/internal-error etc. More specifically, if JobFinishEventHandler in MRAppMaster received JobFinishEvent. It will call unregister/cleanup.
          Does this answer your question?

          Thanks,
          Wangda

          Show
          Wangda Tan added a comment - Hi Mayank Bansal , It should say, we will retry AM indefinitely until it completed and calls unregister. AM complete includes various states like job failed/killed/internal-error etc. More specifically, if JobFinishEventHandler in MRAppMaster received JobFinishEvent. It will call unregister/cleanup. Does this answer your question? Thanks, Wangda
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12654932/MR-5956.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4722//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4722//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12654932/MR-5956.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4722//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4722//console This message is automatically generated.
          Hide
          Mayank Bansal added a comment -

          Lets say if AM is going OOM and getting killed due to am monitor what would be the behavior?

          Thanks,
          Mayank

          Show
          Mayank Bansal added a comment - Lets say if AM is going OOM and getting killed due to am monitor what would be the behavior? Thanks, Mayank
          Hide
          Hitesh Shah added a comment -

          To add to Mayank Bansal's comment, this is the 4th ( and last ) attempt of the AM and there have been no preemptions.

          Show
          Hitesh Shah added a comment - To add to Mayank Bansal 's comment, this is the 4th ( and last ) attempt of the AM and there have been no preemptions.
          Hide
          Wangda Tan added a comment -

          Hi Mayank Bansal and Hitesh Shah,
          Currently, YARN NM will first send SIGTERM then sleep for a while (default is 250ms, set by yarn.nodemanager.sleep-delay-before-sigkill.ms) send SIGKILL if process still alive when trying to kill a container.
          MR shutdown hook can catch SIGTERM. So in Hitesh's status, if AM OOM at last retry and killed by NM ContainersMonitor, AM will not do cleanup. If AM is not last attempt, it will be restarted by RM.
          Thanks,
          Wangda

          Show
          Wangda Tan added a comment - Hi Mayank Bansal and Hitesh Shah , Currently, YARN NM will first send SIGTERM then sleep for a while (default is 250ms, set by yarn.nodemanager.sleep-delay-before-sigkill.ms) send SIGKILL if process still alive when trying to kill a container. MR shutdown hook can catch SIGTERM. So in Hitesh's status, if AM OOM at last retry and killed by NM ContainersMonitor, AM will not do cleanup. If AM is not last attempt, it will be restarted by RM. Thanks, Wangda
          Hide
          Zhijie Shen added a comment -

          The patch looks good to me overall. Just a minor comment:

          For TestStagingCleanup you can remove "extends TestCase", and then remove @Test for the cases that you want to ignore.

          Show
          Zhijie Shen added a comment - The patch looks good to me overall. Just a minor comment: For TestStagingCleanup you can remove "extends TestCase", and then remove @Test for the cases that you want to ignore.
          Hide
          Wangda Tan added a comment -

          Zhijie, thanks for review!
          Uploaded a patch addressed your comment and removed max-attempt field in MRAppMaster since we don't use it now.

          Show
          Wangda Tan added a comment - Zhijie, thanks for review! Uploaded a patch addressed your comment and removed max-attempt field in MRAppMaster since we don't use it now.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12655133/MR-5956.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4726//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4726//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655133/MR-5956.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 4 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4726//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4726//console This message is automatically generated.
          Hide
          Zhijie Shen added a comment -

          +1 will commit the patch.

          Show
          Zhijie Shen added a comment - +1 will commit the patch.
          Hide
          Zhijie Shen added a comment -

          Committed to trunk and branch-2. Thanks, Wangda! Let's move on with YARN-2261 to fix the issue that the staging files are not cleanup in the failure case.

          Show
          Zhijie Shen added a comment - Committed to trunk and branch-2. Thanks, Wangda! Let's move on with YARN-2261 to fix the issue that the staging files are not cleanup in the failure case.
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #5870 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5870/)
          MAPREDUCE-5956. Made MR AM not use maxAttempts to determine if the current attempt is the last retry. Contributed by Wangda Tan. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1609649)

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MRApp.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestJobEndNotifier.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRAppMaster.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestStagingCleanup.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #5870 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5870/ ) MAPREDUCE-5956 . Made MR AM not use maxAttempts to determine if the current attempt is the last retry. Contributed by Wangda Tan. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1609649 ) /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MRApp.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestJobEndNotifier.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRAppMaster.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestStagingCleanup.java
          Hide
          Wangda Tan added a comment -

          Thanks Zhijie Shen for review and commit!

          Show
          Wangda Tan added a comment - Thanks Zhijie Shen for review and commit!
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #610 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/610/)
          MAPREDUCE-5956. Made MR AM not use maxAttempts to determine if the current attempt is the last retry. Contributed by Wangda Tan. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1609649)

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MRApp.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestJobEndNotifier.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRAppMaster.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestStagingCleanup.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #610 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/610/ ) MAPREDUCE-5956 . Made MR AM not use maxAttempts to determine if the current attempt is the last retry. Contributed by Wangda Tan. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1609649 ) /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MRApp.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestJobEndNotifier.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRAppMaster.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestStagingCleanup.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #1828 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1828/)
          MAPREDUCE-5956. Made MR AM not use maxAttempts to determine if the current attempt is the last retry. Contributed by Wangda Tan. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1609649)

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MRApp.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestJobEndNotifier.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRAppMaster.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestStagingCleanup.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1828 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1828/ ) MAPREDUCE-5956 . Made MR AM not use maxAttempts to determine if the current attempt is the last retry. Contributed by Wangda Tan. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1609649 ) /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MRApp.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestJobEndNotifier.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRAppMaster.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestStagingCleanup.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #1801 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1801/)
          MAPREDUCE-5956. Made MR AM not use maxAttempts to determine if the current attempt is the last retry. Contributed by Wangda Tan. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1609649)

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MRApp.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestJobEndNotifier.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRAppMaster.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestStagingCleanup.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #1801 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1801/ ) MAPREDUCE-5956 . Made MR AM not use maxAttempts to determine if the current attempt is the last retry. Contributed by Wangda Tan. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1609649 ) /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MRApp.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestJobEndNotifier.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRAppMaster.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestStagingCleanup.java
          Hide
          Jason Lowe added a comment -

          This is marked as a Blocker for 2.5.0 but fixed in 2.6.0. Should this be committed to branch-2.5 as well?

          Show
          Jason Lowe added a comment - This is marked as a Blocker for 2.5.0 but fixed in 2.6.0. Should this be committed to branch-2.5 as well?
          Hide
          Karthik Kambatla (Inactive) added a comment -

          Reopening this to include in 2.5.1.

          Show
          Karthik Kambatla (Inactive) added a comment - Reopening this to include in 2.5.1.
          Hide
          Zhijie Shen added a comment -

          @Jason, Vinod and I discussed which release the commit should go in, and we decided 2.6. The reason is that the fix will lead to undeleted staging dirs. YARN-2261is supposed to fix it, and we don't want an incomplete solution to enter a release. Thoughts?

          Show
          Zhijie Shen added a comment - @Jason, Vinod and I discussed which release the commit should go in, and we decided 2.6. The reason is that the fix will lead to undeleted staging dirs. YARN-2261 is supposed to fix it, and we don't want an incomplete solution to enter a release. Thoughts?
          Hide
          Ashwin Shankar added a comment -

          Zhijie Shen, could you please elaborate when and why this fix will lead to undeleted staging dirs ?
          In such cases, would the job go missing on the history server ? Any other impact ?
          I'm asking this, because we are backporting this patch into our release.

          Show
          Ashwin Shankar added a comment - Zhijie Shen , could you please elaborate when and why this fix will lead to undeleted staging dirs ? In such cases, would the job go missing on the history server ? Any other impact ? I'm asking this, because we are backporting this patch into our release.
          Hide
          Wangda Tan added a comment -

          Hi Ashwin Shankar,
          You can check my comment https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14056026&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14056026. And Zhijie's comment: https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14057067&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14057067.

          In such cases, would the job go missing on the history server ? Any other impact ?

          I think this will not miss on the JHS, no other impact in my mind now.
          Please let me know if you have any other questions,

          Thanks,
          Wangda

          Show
          Wangda Tan added a comment - Hi Ashwin Shankar , You can check my comment https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14056026&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14056026 . And Zhijie's comment: https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14057067&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14057067 . In such cases, would the job go missing on the history server ? Any other impact ? I think this will not miss on the JHS, no other impact in my mind now. Please let me know if you have any other questions, Thanks, Wangda
          Hide
          Zhijie Shen added a comment -

          Wangda Tan, thanks for your comments. I'd like to add some additional points.

          • Undeleted staging dir is a very rare issue, though it is likely to happen. MR AM will ask for a retry (with staging dir kept for next retry) once failure happens regardless what maxAttempts is. If unfortunately RM finds the job has used up all retry quota, the staging dir will be not cleaned up.
          • The undeleted staging dir won't affect JHS and other MR logic, but you need to evaluate if it will affect your business. Or you can wait until YARN-2261 is ready, but the down side is that the MR job may have a bit fewer retry opportunities that it expects to have.
          Show
          Zhijie Shen added a comment - Wangda Tan , thanks for your comments. I'd like to add some additional points. Undeleted staging dir is a very rare issue, though it is likely to happen. MR AM will ask for a retry (with staging dir kept for next retry) once failure happens regardless what maxAttempts is. If unfortunately RM finds the job has used up all retry quota, the staging dir will be not cleaned up. The undeleted staging dir won't affect JHS and other MR logic, but you need to evaluate if it will affect your business. Or you can wait until YARN-2261 is ready, but the down side is that the MR job may have a bit fewer retry opportunities that it expects to have.
          Hide
          Jason Lowe added a comment -

          When it comes to leaking staging directories, there are far more common cases where that occurs than this scenario. e.g.: application killed before AM starts or in-between AM retries, AM is misconfigured and fails every time, etc. It seems like the scenario we're worried about is highly unlikely, so I don't think it'd be a big deal to put into 2.5.1 from that standpoint. IIRC the problem being fixed here isn't an issue unless preemption is a factor. If we're officially supporting preemption in 2.5 then I think this is a good candidate to consider for 2.5.1, otherwise it can wait until 2.6.

          Show
          Jason Lowe added a comment - When it comes to leaking staging directories, there are far more common cases where that occurs than this scenario. e.g.: application killed before AM starts or in-between AM retries, AM is misconfigured and fails every time, etc. It seems like the scenario we're worried about is highly unlikely, so I don't think it'd be a big deal to put into 2.5.1 from that standpoint. IIRC the problem being fixed here isn't an issue unless preemption is a factor. If we're officially supporting preemption in 2.5 then I think this is a good candidate to consider for 2.5.1, otherwise it can wait until 2.6.
          Hide
          Zhijie Shen added a comment -

          When it comes to leaking staging directories, there are far more common cases where that occurs than this scenario. e.g.: application killed before AM starts or in-between AM retries, AM is misconfigured and fails every time, etc. It seems like the scenario we're worried about is highly unlikely, so I don't think it'd be a big deal to put into 2.5.1 from that standpoint.

          Hm... It makes sense to me. Vinod Kumar Vavilapalli, what do you think?

          Show
          Zhijie Shen added a comment - When it comes to leaking staging directories, there are far more common cases where that occurs than this scenario. e.g.: application killed before AM starts or in-between AM retries, AM is misconfigured and fails every time, etc. It seems like the scenario we're worried about is highly unlikely, so I don't think it'd be a big deal to put into 2.5.1 from that standpoint. Hm... It makes sense to me. Vinod Kumar Vavilapalli , what do you think?
          Hide
          Karthik Kambatla (Inactive) added a comment -

          Spoke to Zhijie Shen offline. Looks like we can leave it out of 2.5.1.

          Show
          Karthik Kambatla (Inactive) added a comment - Spoke to Zhijie Shen offline. Looks like we can leave it out of 2.5.1.

            People

            • Assignee:
              Wangda Tan
              Reporter:
              Vinod Kumar Vavilapalli
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development