Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5567 [Umbrella] Stabilize MR framework w.r.t ResourceManager restart
  3. MAPREDUCE-5476

Job can fail when RM restarts after staging dir is cleaned but before MR successfully unregister with RM

    Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.1.1-beta
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed
    1. MAPREDUCE-5476.1.patch
      9 kB
      Jian He
    2. MAPREDUCE-5476.2.patch
      9 kB
      Jian He
    3. MAPREDUCE-5476.patch
      6 kB
      Jian He
    4. YARN-917.patch
      4 kB
      Jian He

      Issue Links

        Activity

        Hide
        jlowe Jason Lowe added a comment -

        I think one way to solve this is to move the removal of the staging directory to after we unregister from the RM. Now that there's a FINISHING state that gives the app a grace period to finish cleanly, we leverage this to remove the staging directory after unregistering. This should solve some other races related to removal of the staging directory and unregistering (e.g.: AM crashes after removing staging directory but before unregistering).

        Show
        jlowe Jason Lowe added a comment - I think one way to solve this is to move the removal of the staging directory to after we unregister from the RM. Now that there's a FINISHING state that gives the app a grace period to finish cleanly, we leverage this to remove the staging directory after unregistering. This should solve some other races related to removal of the staging directory and unregistering (e.g.: AM crashes after removing staging directory but before unregistering).
        Hide
        jianhe Jian He added a comment -

        The above solution looks good. thinking another way: Does it make sense just to move "addService(createStagingDirCleaningService());" before "addIfService(containerAllocator);" inside MRAppMaster ? In this way, when MR stops, cleanup staging dir will be called after unregister. This looks simpler.

        Show
        jianhe Jian He added a comment - The above solution looks good. thinking another way: Does it make sense just to move "addService(createStagingDirCleaningService());" before "addIfService(containerAllocator);" inside MRAppMaster ? In this way, when MR stops, cleanup staging dir will be called after unregister. This looks simpler.
        Hide
        jlowe Jason Lowe added a comment -

        Yes, that's exactly what I was proposing with my first comment.

        Originally the staging directory cleanup was after the unregister, but there was a problem. Before the FINISHING state was added, the RM would kill the AM container as soon as it unregistered. This meant that sometimes the AM would be killed by the NM (if the NM happened to heartbeat soon enough) before the AM had a chance to cleanup the staging directory, and over time staging directories would start piling up and filling user quotas. The initial fix was to move the staging directory cleanup from after unregistering to just before.

        Now that there's a FINISHING state that allows the AM some time to cleanup before it is killed, it should have plenty of time to remove the staging directory before the RM tries to kill it after unregistering.

        Show
        jlowe Jason Lowe added a comment - Yes, that's exactly what I was proposing with my first comment. Originally the staging directory cleanup was after the unregister, but there was a problem. Before the FINISHING state was added, the RM would kill the AM container as soon as it unregistered. This meant that sometimes the AM would be killed by the NM (if the NM happened to heartbeat soon enough) before the AM had a chance to cleanup the staging directory, and over time staging directories would start piling up and filling user quotas. The initial fix was to move the staging directory cleanup from after unregistering to just before. Now that there's a FINISHING state that allows the AM some time to cleanup before it is killed, it should have plenty of time to remove the staging directory before the RM tries to kill it after unregistering.
        Hide
        jianhe Jian He added a comment -

        upload a patch that reverse the order of unregister and clean staging dir.
        Will do manual test

        Show
        jianhe Jian He added a comment - upload a patch that reverse the order of unregister and clean staging dir. Will do manual test
        Hide
        jlowe Jason Lowe added a comment -

        Patch looks OK to me. Shouldn't be too hard to write a unit test in TestStagingCleanup that wraps the container allocator in a service and verifies when the service is shutdown that the staging directory has not been deleted but is deleted once the application stops.

        Also this JIRA really should be in MAPREDUCE.

        Show
        jlowe Jason Lowe added a comment - Patch looks OK to me. Shouldn't be too hard to write a unit test in TestStagingCleanup that wraps the container allocator in a service and verifies when the service is shutdown that the staging directory has not been deleted but is deleted once the application stops. Also this JIRA really should be in MAPREDUCE.
        Hide
        jianhe Jian He added a comment -

        New patch add tests as Jason suggested.

        Show
        jianhe Jian He added a comment - New patch add tests as Jason suggested.
        Hide
        jianhe Jian He added a comment -

        Did manual test on single node cluster.
        1. Put a long sleep before the unregister call of RMCommunicator.
        2. Kill the AM while it's sleeping and restart the RM.

        Without the patch, the following restarted AMs (up to max-num-am-retry) will fail since the staging dir has already been removed.
        With the patch, the restarted AM is able to continue and succeed.

        Show
        jianhe Jian He added a comment - Did manual test on single node cluster. 1. Put a long sleep before the unregister call of RMCommunicator. 2. Kill the AM while it's sleeping and restart the RM. Without the patch, the following restarted AMs (up to max-num-am-retry) will fail since the staging dir has already been removed. With the patch, the restarted AM is able to continue and succeed.
        Hide
        hadoopqa Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12599281/MAPREDUCE-5476.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3953//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3953//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12599281/MAPREDUCE-5476.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3953//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3953//console This message is automatically generated.
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        The approach sounds fine to me. This looks like a clear follow up from MAPREDUCE-4819 that we never pursued.

        OTOH, though I didn't always feel comforable with the FINISHING state, things like these prove its practical wins

        Re the patch:

        • The java comments needs to be slightly improved to reflect what we are really doing
        • Test:
          • Rename cleanupStagingDir to stagingDirCleanedUp.
          • Not related to your patch but can you also modify the test a little to make sure the JobHistoryEventHandler is the first thing to shut-down?
        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - The approach sounds fine to me. This looks like a clear follow up from MAPREDUCE-4819 that we never pursued. OTOH, though I didn't always feel comforable with the FINISHING state, things like these prove its practical wins Re the patch: The java comments needs to be slightly improved to reflect what we are really doing Test: Rename cleanupStagingDir to stagingDirCleanedUp. Not related to your patch but can you also modify the test a little to make sure the JobHistoryEventHandler is the first thing to shut-down?
        Hide
        jianhe Jian He added a comment -

        Upload a new patch.

        Fixed the code comments and rename as suggested.
        Changed the test case to test that services stops in the order of JobHistoryEventHandler, ContainerAllocator and StagingDirCleaningService

        Show
        jianhe Jian He added a comment - Upload a new patch. Fixed the code comments and rename as suggested. Changed the test case to test that services stops in the order of JobHistoryEventHandler, ContainerAllocator and StagingDirCleaningService
        Hide
        jianhe Jian He added a comment -

        New patch fixed a typo

        Show
        jianhe Jian He added a comment - New patch fixed a typo
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        That's a nice little test you wrote there!

        +1, will commit this once Jenkins says okay..

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - That's a nice little test you wrote there! +1, will commit this once Jenkins says okay..
        Hide
        hadoopqa Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12599488/MAPREDUCE-5476.2.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3955//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3955//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12599488/MAPREDUCE-5476.2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3955//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3955//console This message is automatically generated.
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        Committed this to trunk, branch-2 and branch-2.1. Thanks Jian!

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - Committed this to trunk, branch-2 and branch-2.1. Thanks Jian!
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-trunk-Commit #4313 (See https://builds.apache.org/job/Hadoop-trunk-Commit/4313/)
        MAPREDUCE-5476. Changed MR AM recovery code to cleanup staging-directory only after unregistering from the RM. Contributed by Jian He. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1516660)

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestStagingCleanup.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #4313 (See https://builds.apache.org/job/Hadoop-trunk-Commit/4313/ ) MAPREDUCE-5476 . Changed MR AM recovery code to cleanup staging-directory only after unregistering from the RM. Contributed by Jian He. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1516660 ) /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestStagingCleanup.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Yarn-trunk #310 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/310/)
        MAPREDUCE-5476. Changed MR AM recovery code to cleanup staging-directory only after unregistering from the RM. Contributed by Jian He. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1516660)

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestStagingCleanup.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #310 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/310/ ) MAPREDUCE-5476 . Changed MR AM recovery code to cleanup staging-directory only after unregistering from the RM. Contributed by Jian He. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1516660 ) /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestStagingCleanup.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Hdfs-trunk #1500 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1500/)
        MAPREDUCE-5476. Changed MR AM recovery code to cleanup staging-directory only after unregistering from the RM. Contributed by Jian He. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1516660)

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestStagingCleanup.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Hdfs-trunk #1500 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1500/ ) MAPREDUCE-5476 . Changed MR AM recovery code to cleanup staging-directory only after unregistering from the RM. Contributed by Jian He. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1516660 ) /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestStagingCleanup.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk #1527 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1527/)
        MAPREDUCE-5476. Changed MR AM recovery code to cleanup staging-directory only after unregistering from the RM. Contributed by Jian He. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1516660)

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestStagingCleanup.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1527 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1527/ ) MAPREDUCE-5476 . Changed MR AM recovery code to cleanup staging-directory only after unregistering from the RM. Contributed by Jian He. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1516660 ) /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestStagingCleanup.java

          People

          • Assignee:
            jianhe Jian He
            Reporter:
            jianhe Jian He
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development