Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3614

FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.5.0, 2.7.0
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:

      Description

      FileSystemRMStateStore is only a accessorial plug-in of rmstore.
      When it failed to remove application, I think warning is enough, but now resourcemanager crashed.

      Recently, I configure "yarn.resourcemanager.state-store.max-completed-applications" to limit applications number in rmstore. when applications number exceed the limit, some old applications will be removed. If failed to remove, resourcemanager will crash.
      The following is log:

      2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1430994493305_0053
      2015-05-11 06:58:43,815 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Removing info for app: application_1430994493305_0053 at: /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
      2015-05-11 06:58:43,816 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error removing app: application_1430994493305_0053
      java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
      at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
      at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
      at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
      at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
      at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
      at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
      at java.lang.Thread.run(Thread.java:745)
      2015-05-11 06:58:43,819 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:
      java.lang.Exception: Failed to delete /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
      at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
      at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
      at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
      at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
      at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
      at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
      at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
      at java.lang.Thread.run(Thread.java:745)

        Activity

        Hide
        ozawa Tsuyoshi Ozawa added a comment -

        lachisis thank you for reporting this issue. I think this issue is resolved by operation-level retry of FSRMStateStore implemented on YARN-2820. The feature is merged on 2.7.0. I think 2.7.1 is coming soon, so could you use it for your development?

        Show
        ozawa Tsuyoshi Ozawa added a comment - lachisis thank you for reporting this issue. I think this issue is resolved by operation-level retry of FSRMStateStore implemented on YARN-2820 . The feature is merged on 2.7.0. I think 2.7.1 is coming soon, so could you use it for your development?
        Hide
        lachisis lachisis added a comment -

        Thanks for your attention.
        I have downladed the 2.7.0, and review the FileSystemRMStateStore.java implementation.
        But I think it dosen't fix the issue which I submitted.

        The followinf is the code of 2.7.0. If "fs.delete" return false, it still thows Exception. I think a warning is enough here. otherwise, if someone move this application folder manually, Exception will throw through function "deleteFile", "deleteFileWithRetries", "removeApplicationStateInternal".

        @Override
        public synchronized void removeApplicationStateInternal(
        ApplicationStateData appState)
        throws Exception

        { ApplicationId appId = appState.getApplicationSubmissionContext().getApplicationId(); Path nodeRemovePath = getAppDir(rmAppRoot, appId); LOG.info("Removing info for app: " + appId + " at: " + nodeRemovePath); deleteFileWithRetries(nodeRemovePath); }

        private void deleteFileWithRetries(final Path deletePath) throws Exception {
        new FSAction<Void>() {
        @Override
        public Void run() throws Exception

        { deleteFile(deletePath); return null; }

        }.runWithRetries();
        }

        private void deleteFile(Path deletePath) throws Exception {
        if(!fs.delete(deletePath, true))

        { throw new Exception("Failed to delete " + deletePath); }

        }

        Show
        lachisis lachisis added a comment - Thanks for your attention. I have downladed the 2.7.0, and review the FileSystemRMStateStore.java implementation. But I think it dosen't fix the issue which I submitted. The followinf is the code of 2.7.0. If "fs.delete" return false, it still thows Exception. I think a warning is enough here. otherwise, if someone move this application folder manually, Exception will throw through function "deleteFile", "deleteFileWithRetries", "removeApplicationStateInternal". @Override public synchronized void removeApplicationStateInternal( ApplicationStateData appState) throws Exception { ApplicationId appId = appState.getApplicationSubmissionContext().getApplicationId(); Path nodeRemovePath = getAppDir(rmAppRoot, appId); LOG.info("Removing info for app: " + appId + " at: " + nodeRemovePath); deleteFileWithRetries(nodeRemovePath); } private void deleteFileWithRetries(final Path deletePath) throws Exception { new FSAction<Void>() { @Override public Void run() throws Exception { deleteFile(deletePath); return null; } }.runWithRetries(); } private void deleteFile(Path deletePath) throws Exception { if(!fs.delete(deletePath, true)) { throw new Exception("Failed to delete " + deletePath); } }
        Hide
        ozawa Tsuyoshi Ozawa added a comment -

        Thank you for clarification. On YARN-3410, whose target is 2.8.0, the problem looks to be addressed since removeApplication check the existence of the directory. Please correct me if I'm wrong.

          @Override
          public synchronized void removeApplication(ApplicationId removeAppId)
              throws Exception {
            Path nodeRemovePath = getAppDir(rmAppRoot, removeAppId);
            if (existsWithRetries(nodeRemovePath)) {
              deleteFileWithRetries(nodeRemovePath);
            }
          }
        
        Show
        ozawa Tsuyoshi Ozawa added a comment - Thank you for clarification. On YARN-3410 , whose target is 2.8.0, the problem looks to be addressed since removeApplication check the existence of the directory. Please correct me if I'm wrong. @Override public synchronized void removeApplication(ApplicationId removeAppId) throws Exception { Path nodeRemovePath = getAppDir(rmAppRoot, removeAppId); if (existsWithRetries(nodeRemovePath)) { deleteFileWithRetries(nodeRemovePath); } }
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        YARN-3410 try to remove the application from RMStateStore which is used as RM start up arguments i.e ./yarn resourcemanager -remove-application-from-state-store <appId>.

        I am wondering about the use case that why someone move this application folder manually?? OTOH, it is better either check for path existence of handle the exception and log WARN message instead of throwing exception which crashes the RM

        Show
        rohithsharma Rohith Sharma K S added a comment - YARN-3410 try to remove the application from RMStateStore which is used as RM start up arguments i.e ./yarn resourcemanager -remove-application-from-state-store <appId> . I am wondering about the use case that why someone move this application folder manually?? OTOH, it is better either check for path existence of handle the exception and log WARN message instead of throwing exception which crashes the RM
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        lachisis Would you be interest in providing patch? feel free to take up!!.

        Show
        rohithsharma Rohith Sharma K S added a comment - lachisis Would you be interest in providing patch? feel free to take up!!.
        Hide
        ozawa Tsuyoshi Ozawa added a comment -

        @Rohith FSRMStateStore has checked path existence before removing the path. Do I missing something?

        @lachisis I appreciate if you can provide a patch

        Show
        ozawa Tsuyoshi Ozawa added a comment - @Rohith FSRMStateStore has checked path existence before removing the path. Do I missing something? @lachisis I appreciate if you can provide a patch
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        Some methods does not check for existence of path like removeRMDTMasterKeyState removeApplicationStateInternal removeRMDelegationTokenState and removeRMDTMasterKeyState .. Am I right?

        Show
        rohithsharma Rohith Sharma K S added a comment - Some methods does not check for existence of path like removeRMDTMasterKeyState removeApplicationStateInternal removeRMDelegationTokenState and removeRMDTMasterKeyState .. Am I right?
        Hide
        ozawa Tsuyoshi Ozawa added a comment -

        Rohith Sharma K S thank you for clarification, I got the point. You're right.

        lachisis do you have a chance to create a patch dealing with following things?

        • Creating a helper method like "checkAndRemovePathWithRetries()", which calls existsWithRetries and deleteFileWithRetries internally.
        • Updating call checkAndRemovePathWithRetries() in the files.
        Show
        ozawa Tsuyoshi Ozawa added a comment - Rohith Sharma K S thank you for clarification, I got the point. You're right. lachisis do you have a chance to create a patch dealing with following things? Creating a helper method like "checkAndRemovePathWithRetries()", which calls existsWithRetries and deleteFileWithRetries internally. Updating call checkAndRemovePathWithRetries() in the files.
        Hide
        ozawa Tsuyoshi Ozawa added a comment -

        checkAndRemovePathWithRetries

        checkAndDeleteFileWithRetries would be more consistent, personally.

        Show
        ozawa Tsuyoshi Ozawa added a comment - checkAndRemovePathWithRetries checkAndDeleteFileWithRetries would be more consistent, personally.
        Hide
        lachisis lachisis added a comment -

        Yes, it is ok to check the existence of the directory first.

        Show
        lachisis lachisis added a comment - Yes, it is ok to check the existence of the directory first.
        Hide
        lachisis lachisis added a comment -

        Yes, it is ok to check the existence of the directory first.

        Show
        lachisis lachisis added a comment - Yes, it is ok to check the existence of the directory first.
        Hide
        lachisis lachisis added a comment -

        Yes, it is ok to check the existence of the directory first.

        Show
        lachisis lachisis added a comment - Yes, it is ok to check the existence of the directory first.
        Hide
        lachisis lachisis added a comment -

        Yes, it is ok to check the existence of the directory first.

        Show
        lachisis lachisis added a comment - Yes, it is ok to check the existence of the directory first.
        Hide
        lachisis lachisis added a comment -

        Sorry, terrible network. How can i delete the repeated replys.

        Show
        lachisis lachisis added a comment - Sorry, terrible network. How can i delete the repeated replys.
        Hide
        lachisis lachisis added a comment -

        I used HA of yarn for stable service.
        Months later, I find when standby resourcemanager try to transitiontoActiver, it will cost more than ten minutes to load applications. So I backup the rmstore in hdfs and change the configure "yarn.resourcemanager.state-store.max-completed-applications" to limit applications number in rmstroe. And find it work well when transition.
        Later my partner restore backuped rmstore, and submitted a new application, then find resoucemanager cashed.

        I know restoring backuped rmstore when resourcemanager running is not suitable. But this also means the processing logic of FileSystemRMStateStore is weak a liitle. So I suggest a little change here.

        Show
        lachisis lachisis added a comment - I used HA of yarn for stable service. Months later, I find when standby resourcemanager try to transitiontoActiver, it will cost more than ten minutes to load applications. So I backup the rmstore in hdfs and change the configure "yarn.resourcemanager.state-store.max-completed-applications" to limit applications number in rmstroe. And find it work well when transition. Later my partner restore backuped rmstore, and submitted a new application, then find resoucemanager cashed. I know restoring backuped rmstore when resourcemanager running is not suitable. But this also means the processing logic of FileSystemRMStateStore is weak a liitle. So I suggest a little change here.
        Hide
        lachisis lachisis added a comment -

        Thanks for the chance to provide the patch.
        I will submit the patch later.

        Show
        lachisis lachisis added a comment - Thanks for the chance to provide the patch. I will submit the patch later.
        Hide
        brahmareddy Brahma Reddy Battula added a comment -

        when standby resourcemanager try to transitiontoActive, it will cost more than ten minutes to load applications

        did you dig into this one, like why it's took 10mins..? Thanks

        Show
        brahmareddy Brahma Reddy Battula added a comment - when standby resourcemanager try to transitiontoActive, it will cost more than ten minutes to load applications did you dig into this one, like why it's took 10mins..? Thanks
        Hide
        nijel nijel added a comment -

        hi @lachisis

        when standby resourcemanager try to transitiontoActive, it will cost more than ten minutes to load applications

        Is this a secure cluster ?

        Show
        nijel nijel added a comment - hi @lachisis when standby resourcemanager try to transitiontoActive, it will cost more than ten minutes to load applications Is this a secure cluster ?
        Hide
        lachisis lachisis added a comment -

        Yes it is. But need to configure "yarn.resourcemanager.state-store.max-completed-applications" to limit applications number in rmstore.
        Before modify the configure, it will cost ten minutes to switch to active when four thousand apps in rmstore. that situation is not comfortable.

        Show
        lachisis lachisis added a comment - Yes it is. But need to configure "yarn.resourcemanager.state-store.max-completed-applications" to limit applications number in rmstore. Before modify the configure, it will cost ten minutes to switch to active when four thousand apps in rmstore. that situation is not comfortable.
        Hide
        nijel nijel added a comment -

        One possible cause is discussed in YARN-868
        Can you try the solution given in this issue.

        Show
        nijel nijel added a comment - One possible cause is discussed in YARN-868 Can you try the solution given in this issue.
        Hide
        lachisis lachisis added a comment -

        check file exist before delete.

        Show
        lachisis lachisis added a comment - check file exist before delete.

          People

          • Assignee:
            Unassigned
            Reporter:
            lachisis lachisis
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development