Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5136

Error in handling event type APP_ATTEMPT_REMOVED to the scheduler

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.7.1
    • Fix Version/s: 2.9.0, 3.0.0-alpha2
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      move app cause rm exit

      2016-05-24 23:20:47,202 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_REMOVED to the scheduler
      java.lang.IllegalStateException: Given app to remove org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt@ea94c3b does not exist in queue [root.bdp_xx.bdp_mart_xx_formal, demand=<memory:28672000, vCores:14000>, running=<memory:28647424, vCores:13422>, share=<memory:28672000, vCores:0>, w=<memory weight=1.0, cpu weight=1.0>]
          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:119)
          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:779)
          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1231)
          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:114)
          at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:680)
          at java.lang.Thread.run(Thread.java:745)
      2016-05-24 23:20:47,202 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e04_1464073905025_15410_01_001759 Container Transitioned from ACQUIRED to RELEASED
      2016-05-24 23:20:47,202 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
      
      1. YARN-5136.2.patch
        8 kB
        Wilfred Spiegelenburg
      2. YARN-5136.1.patch
        7 kB
        Wilfred Spiegelenburg

        Issue Links

          Activity

          Hide
          wilfreds Wilfred Spiegelenburg added a comment -

          Hi tangshangwen do you mind if I assign this to myself? I have just run into the same issue and would like to provide a fix for this.

          Show
          wilfreds Wilfred Spiegelenburg added a comment - Hi tangshangwen do you mind if I assign this to myself? I have just run into the same issue and would like to provide a fix for this.
          Hide
          tangshangwen tangshangwen added a comment -
          Show
          tangshangwen tangshangwen added a comment - Wilfred Spiegelenburg ok
          Hide
          wilfreds Wilfred Spiegelenburg added a comment -

          I was thrown of track a bit with all the changes that were made to the locking in the scheduler in YARN-3139.

          After analysis it shows that the issue is not resolved yet and we have two situations that can cause a the above mentioned problem:

          1. if a call for a removeApplicationAttempt and a moveApplication for the same attempt are processed in that order in short succession the application attempt will still contain a queue reference but is already removed from the list of applications for the queue
          2. if two calls to removeApplicationAttempt come in in short succession the application will still contain a queue reference but is already removed from the list of applications for the queue

          In both cases the 2nd call must come in before the removeApplication call is made.

          Show
          wilfreds Wilfred Spiegelenburg added a comment - I was thrown of track a bit with all the changes that were made to the locking in the scheduler in YARN-3139 . After analysis it shows that the issue is not resolved yet and we have two situations that can cause a the above mentioned problem: if a call for a removeApplicationAttempt and a moveApplication for the same attempt are processed in that order in short succession the application attempt will still contain a queue reference but is already removed from the list of applications for the queue if two calls to removeApplicationAttempt come in in short succession the application will still contain a queue reference but is already removed from the list of applications for the queue In both cases the 2nd call must come in before the removeApplication call is made.
          Hide
          wilfreds Wilfred Spiegelenburg added a comment -

          Patch to prevent double removal and a move after removal
          Also changes the an IllegalStateException for a checked case to a YarnException so it does not take the RM down for that case

          Show
          wilfreds Wilfred Spiegelenburg added a comment - Patch to prevent double removal and a move after removal Also changes the an IllegalStateException for a checked case to a YarnException so it does not take the RM down for that case
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 17s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 6m 44s trunk passed
          +1 compile 0m 33s trunk passed
          +1 checkstyle 0m 23s trunk passed
          +1 mvnsite 0m 38s trunk passed
          +1 mvneclipse 0m 17s trunk passed
          +1 findbugs 0m 58s trunk passed
          +1 javadoc 0m 22s trunk passed
          +1 mvninstall 0m 31s the patch passed
          +1 compile 0m 30s the patch passed
          +1 javac 0m 30s the patch passed
          +1 checkstyle 0m 21s the patch passed
          +1 mvnsite 0m 36s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 3s the patch passed
          +1 javadoc 0m 19s the patch passed
          -1 unit 42m 10s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 17s The patch does not generate ASF License warnings.
          57m 30s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart
            hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue YARN-5136
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12839216/YARN-5136.1.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 3419048b5355 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / b8690a9
          Default Java 1.8.0_101
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/13941/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13941/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/13941/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 44s trunk passed +1 compile 0m 33s trunk passed +1 checkstyle 0m 23s trunk passed +1 mvnsite 0m 38s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 0m 58s trunk passed +1 javadoc 0m 22s trunk passed +1 mvninstall 0m 31s the patch passed +1 compile 0m 30s the patch passed +1 javac 0m 30s the patch passed +1 checkstyle 0m 21s the patch passed +1 mvnsite 0m 36s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 3s the patch passed +1 javadoc 0m 19s the patch passed -1 unit 42m 10s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 57m 30s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart   hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-5136 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12839216/YARN-5136.1.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 3419048b5355 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / b8690a9 Default Java 1.8.0_101 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/13941/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13941/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13941/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          wilfreds Wilfred Spiegelenburg added a comment -

          TestRMRestart#testFinishedAppRemovalAfterRMRestart failure is logged as YARN-5362 and closed as resolved. It looks like the change has not fixed it completely. Maybe a follow up needs to be logged for that.
          TestTokenClientRMService#testCancelWithMultipleAppSubmissions failure is tracked in YARN-5816 and is not caused by this change.

          Both tests pass in my local testing.

          Show
          wilfreds Wilfred Spiegelenburg added a comment - TestRMRestart#testFinishedAppRemovalAfterRMRestart failure is logged as YARN-5362 and closed as resolved. It looks like the change has not fixed it completely. Maybe a follow up needs to be logged for that. TestTokenClientRMService#testCancelWithMultipleAppSubmissions failure is tracked in YARN-5816 and is not caused by this change. Both tests pass in my local testing.
          Hide
          wilfreds Wilfred Spiegelenburg added a comment -

          Opened YARN-5895 to track the new failure in TestRMRestart#testFinishedAppRemovalAfterRMRestart

          Show
          wilfreds Wilfred Spiegelenburg added a comment - Opened YARN-5895 to track the new failure in TestRMRestart#testFinishedAppRemovalAfterRMRestart
          Hide
          templedf Daniel Templeton added a comment -

          Thanks for the patch. It looks to me like it might be better to throw a YarnException in moveApplication() rather than just returning the current queue's name. The exception gets swallowed by the transition, so it shouldn't hurt anything, and it feels like the more natural path, rather than pretending that everything's OK. Also your tests don't explicitly test anything. I get that you're just seeing if the operation blows up, but it would be nice to do some additional confirmation, like that the app is still in the original queue.

          Show
          templedf Daniel Templeton added a comment - Thanks for the patch. It looks to me like it might be better to throw a YarnException in moveApplication() rather than just returning the current queue's name. The exception gets swallowed by the transition, so it shouldn't hurt anything, and it feels like the more natural path, rather than pretending that everything's OK. Also your tests don't explicitly test anything. I get that you're just seeing if the operation blows up, but it would be nice to do some additional confirmation, like that the app is still in the original queue.
          Hide
          wilfreds Wilfred Spiegelenburg added a comment -

          Updated the patch with the review comments:

          • added state checks in the tests
          • change the return to a throw if the app was stopped before the move
          Show
          wilfreds Wilfred Spiegelenburg added a comment - Updated the patch with the review comments: added state checks in the tests change the return to a throw if the app was stopped before the move
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 12s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 7m 29s trunk passed
          +1 compile 0m 35s trunk passed
          +1 checkstyle 0m 24s trunk passed
          +1 mvnsite 0m 40s trunk passed
          +1 mvneclipse 0m 18s trunk passed
          +1 findbugs 1m 2s trunk passed
          +1 javadoc 0m 21s trunk passed
          +1 mvninstall 0m 33s the patch passed
          +1 compile 0m 32s the patch passed
          +1 javac 0m 32s the patch passed
          +1 checkstyle 0m 21s the patch passed
          +1 mvnsite 0m 39s the patch passed
          +1 mvneclipse 0m 15s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 10s the patch passed
          +1 javadoc 0m 20s the patch passed
          -1 unit 38m 57s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 21s The patch does not generate ASF License warnings.
          55m 24s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue YARN-5136
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12841441/YARN-5136.2.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux a3e90b777743 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / c87b3a4
          Default Java 1.8.0_111
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/14157/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14157/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/14157/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 12s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 29s trunk passed +1 compile 0m 35s trunk passed +1 checkstyle 0m 24s trunk passed +1 mvnsite 0m 40s trunk passed +1 mvneclipse 0m 18s trunk passed +1 findbugs 1m 2s trunk passed +1 javadoc 0m 21s trunk passed +1 mvninstall 0m 33s the patch passed +1 compile 0m 32s the patch passed +1 javac 0m 32s the patch passed +1 checkstyle 0m 21s the patch passed +1 mvnsite 0m 39s the patch passed +1 mvneclipse 0m 15s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 10s the patch passed +1 javadoc 0m 20s the patch passed -1 unit 38m 57s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 21s The patch does not generate ASF License warnings. 55m 24s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-5136 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12841441/YARN-5136.2.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux a3e90b777743 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / c87b3a4 Default Java 1.8.0_111 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/14157/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14157/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/14157/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          templedf Daniel Templeton added a comment -

          Thanks for the update, Wilfred Spiegelenburg. Looks like the move test isn't testing the app's queue after the move yet.

          Show
          templedf Daniel Templeton added a comment - Thanks for the update, Wilfred Spiegelenburg . Looks like the move test isn't testing the app's queue after the move yet.
          Hide
          wilfreds Wilfred Spiegelenburg added a comment -

          That test finishes with an exception being thrown (and caught as declared in the test expected) and because of that no update of the application object happens. Testing the application after the exception is thrown does not make sense to me based on that change. I left out testing it. If the application is changed there will be no exception and the test fails.

          Show
          wilfreds Wilfred Spiegelenburg added a comment - That test finishes with an exception being thrown (and caught as declared in the test expected) and because of that no update of the application object happens. Testing the application after the exception is thrown does not make sense to me based on that change. I left out testing it. If the application is changed there will be no exception and the test fails.
          Hide
          templedf Daniel Templeton added a comment -

          Yep, good point. +1 on the latest patch. I'll commit shortly.

          Show
          templedf Daniel Templeton added a comment - Yep, good point. +1 on the latest patch. I'll commit shortly.
          Hide
          templedf Daniel Templeton added a comment -

          Thanks for the patch, Wilfred Spiegelenburg! Committed to trunk and branch-2.

          Show
          templedf Daniel Templeton added a comment - Thanks for the patch, Wilfred Spiegelenburg ! Committed to trunk and branch-2.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10961 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10961/)
          YARN-5136. Error in handling event type APP_ATTEMPT_REMOVED to the (templedf: rev 9f5d2c4fff6d31acc8b422b52462ef4927c4eea1)

          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10961 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10961/ ) YARN-5136 . Error in handling event type APP_ATTEMPT_REMOVED to the (templedf: rev 9f5d2c4fff6d31acc8b422b52462ef4927c4eea1) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
          Hide
          wilfreds Wilfred Spiegelenburg added a comment -

          Thank you Daniel Templeton for the review and commit

          Show
          wilfreds Wilfred Spiegelenburg added a comment - Thank you Daniel Templeton for the review and commit
          Hide
          djp Junping Du added a comment -

          This patch goes to branch-2 only instead of branch-2.8, set 2.9 as fix version.

          Show
          djp Junping Du added a comment - This patch goes to branch-2 only instead of branch-2.8, set 2.9 as fix version.

            People

            • Assignee:
              wilfreds Wilfred Spiegelenburg
              Reporter:
              tangshangwen tangshangwen
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development