Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3813

Support Application timeout feature in YARN.

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 3.0.0
    • Component/s: scheduler
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      It will be useful to support Application Timeout in YARN. Some use cases are not worried about the output of the applications if the application is not completed in a specific time.

      Background:
      The requirement is to show the CDR statistics of last few minutes, say for every 5 minutes. The same Job will run continuously with different dataset.
      So one job will be started in every 5 minutes. The estimate time for this task is 2 minutes or lesser time.
      If the application is not completing in the given time the output is not useful.

      Proposal
      So idea is to support application timeout, with which timeout parameter is given while submitting the job.
      Here, user is expecting to finish (complete or kill) the application in the given time.

      One option for us is to move this logic to Application client (who submit the job).
      But it will be nice if it can be generic logic and can make more robust.

      Kindly provide your suggestions/opinion on this feature. If it sounds good, i will update the design doc and prototype patch

      1. YARN Application Timeout .pdf
        100 kB
        nijel
      2. 0001-YARN-3813.patch
        31 kB
        nijel
      3. 0002_YARN-3813.patch
        32 kB
        nijel
      4. Yarn Application Timeout.v1.pdf
        301 kB
        Rohith Sharma K S

        Issue Links

          Activity

          Hide
          andrew.wang Andrew Wang added a comment -

          Hi, could someone add a release note for this new feature?

          Show
          andrew.wang Andrew Wang added a comment - Hi, could someone add a release note for this new feature?
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Closing this issue as all sub-items are done.

          Show
          rohithsharma Rohith Sharma K S added a comment - Closing this issue as all sub-items are done.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Assigning to myself, I will be working on this feature JIRA

          Show
          rohithsharma Rohith Sharma K S added a comment - Assigning to myself, I will be working on this feature JIRA
          Hide
          leftnoteasy Wangda Tan added a comment -

          nijel, I think you created a sub JIRA for this task, so I'm cancelling patch from this JIRA.

          Show
          leftnoteasy Wangda Tan added a comment - nijel , I think you created a sub JIRA for this task, so I'm cancelling patch from this JIRA.
          Hide
          nijel nijel added a comment -

          Thanks Wangda Tan for the comments
          I will update the patch with the code comments

          Do you plan to support updating lifetime when the application is running?

          As per our understanding the following 2 are the use cases for this
          1. User can increase the life time after some time and seeing the progress which already being monitored by timeout monitor
          2. User can add a timeout for a running application so that this also will monitored.

          In both these cases the updated time out will be the total life time (from submitted time)

          Please let us know you are thinking of any other scenario so that we can pla the interfaces accordingly.

          Do you plan to get lifetime via ApplicationReport, CLI, REST API?

          Yes. As of now we plan for ApplicationReport. Based on the dynamic update the interfaces can be defined and handle as a subtask.

          We had some offline chat as well. Few subtsks raised for the planned work. Please give your opinion

          Show
          nijel nijel added a comment - Thanks Wangda Tan for the comments I will update the patch with the code comments Do you plan to support updating lifetime when the application is running? As per our understanding the following 2 are the use cases for this 1. User can increase the life time after some time and seeing the progress which already being monitored by timeout monitor 2. User can add a timeout for a running application so that this also will monitored. In both these cases the updated time out will be the total life time (from submitted time) Please let us know you are thinking of any other scenario so that we can pla the interfaces accordingly. Do you plan to get lifetime via ApplicationReport, CLI, REST API? Yes. As of now we plan for ApplicationReport. Based on the dynamic update the interfaces can be defined and handle as a subtask. We had some offline chat as well. Few subtsks raised for the planned work. Please give your opinion
          Hide
          leftnoteasy Wangda Tan added a comment -

          Thanks nijel and Rohith Sharma K S for working on this, it is a very useful feature to me.

          Comments for existing patch:

          • Instead of Timeout-, is it better to rename it to "lifetime"? Timeout will confuse people with the application-heartbeat-timeout.
          • synchronized lock in RMAppTimeOutMonitor is unnecessary, you can use putIfAbsent instead, and remove doesn't need to do check its existence
          • All LOG.debug should be placed in LOG.isDebugEnabled
          • threadSleepTime -> something like monitoringInterval?

          Some questions about some related tasks/plans:

          • Do you plan to support updating lifetime when the application is running?
          • Do you plan to get lifetime via ApplicationReport, CLI, REST API?

          Thanks,

          Show
          leftnoteasy Wangda Tan added a comment - Thanks nijel and Rohith Sharma K S for working on this, it is a very useful feature to me. Comments for existing patch: Instead of Timeout-, is it better to rename it to "lifetime"? Timeout will confuse people with the application-heartbeat-timeout. synchronized lock in RMAppTimeOutMonitor is unnecessary, you can use putIfAbsent instead, and remove doesn't need to do check its existence All LOG.debug should be placed in LOG.isDebugEnabled threadSleepTime -> something like monitoringInterval? Some questions about some related tasks/plans: Do you plan to support updating lifetime when the application is running? Do you plan to get lifetime via ApplicationReport, CLI, REST API? Thanks,
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 19m 45s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 2 new or modified test files.
          +1 javac 8m 3s There were no new javac warning messages.
          +1 javadoc 10m 7s There were no new javadoc warning messages.
          +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings.
          -1 checkstyle 1m 56s The applied patch generated 2 new checkstyle issues (total was 238, now 239).
          -1 whitespace 0m 2s The patch has 6 line(s) that end in whitespace. Use git apply --whitespace=fix.
          +1 install 1m 32s mvn install still works.
          +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse.
          -1 findbugs 4m 45s The patch appears to introduce 1 new Findbugs (version 3.0.0) warnings.
          -1 yarn tests 0m 21s Tests failed in hadoop-yarn-api.
          +1 yarn tests 2m 2s Tests passed in hadoop-yarn-common.
          -1 yarn tests 190m 59s Tests failed in hadoop-yarn-server-resourcemanager.
              241m 1s  



          Reason Tests
          FindBugs module:hadoop-yarn-server-resourcemanager
          Failed unit tests hadoop.yarn.conf.TestYarnConfigurationFields
            hadoop.yarn.server.resourcemanager.rmcontainer.TestRMContainerImpl
            hadoop.yarn.server.resourcemanager.TestApplicationCleanup
            hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions
            hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched
            hadoop.yarn.server.resourcemanager.TestRMHA
            hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerQueueACLs
            hadoop.yarn.server.resourcemanager.TestResourceManager
            hadoop.yarn.server.resourcemanager.TestApplicationMasterService
            hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA
            hadoop.yarn.server.resourcemanager.scheduler.TestSchedulerHealth
            hadoop.yarn.server.resourcemanager.scheduler.TestSchedulerUtils
            hadoop.yarn.server.resourcemanager.scheduler.capacity.TestWorkPreservingRMRestartForNodeLabel
            hadoop.yarn.server.resourcemanager.TestAMAuthorization
            hadoop.yarn.server.resourcemanager.TestApplicationMasterLauncher
            hadoop.yarn.server.resourcemanager.TestResourceTrackerService
            hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation
            hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerQueueACLs
            hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
            hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerDynamicBehavior
            hadoop.yarn.server.resourcemanager.TestApplicationACLs
            hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler
            hadoop.yarn.server.resourcemanager.TestContainerResourceUsage
            hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler
            hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
            hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerNodeLabelUpdate
            hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA
            hadoop.yarn.server.resourcemanager.rmapp.TestApplicationTimeOut
            hadoop.yarn.server.resourcemanager.scheduler.fair.TestAppRunnability
            hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority
            hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
            hadoop.yarn.server.resourcemanager.logaggregationstatus.TestRMAppLogAggregationStatus
            hadoop.yarn.server.resourcemanager.rmapp.TestNodesListManager
          Timed out tests org.apache.hadoop.yarn.server.resourcemanager.TestRM
            org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps
            org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
            org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
            org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12761922/0002_YARN-3813.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / c890c51
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/9244/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
          whitespace https://builds.apache.org/job/PreCommit-YARN-Build/9244/artifact/patchprocess/whitespace.txt
          Findbugs warnings https://builds.apache.org/job/PreCommit-YARN-Build/9244/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
          hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/9244/artifact/patchprocess/testrun_hadoop-yarn-api.txt
          hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/9244/artifact/patchprocess/testrun_hadoop-yarn-common.txt
          hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9244/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9244/testReport/
          Java 1.7.0_55
          uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9244/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 19m 45s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 2 new or modified test files. +1 javac 8m 3s There were no new javac warning messages. +1 javadoc 10m 7s There were no new javadoc warning messages. +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 1m 56s The applied patch generated 2 new checkstyle issues (total was 238, now 239). -1 whitespace 0m 2s The patch has 6 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 install 1m 32s mvn install still works. +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse. -1 findbugs 4m 45s The patch appears to introduce 1 new Findbugs (version 3.0.0) warnings. -1 yarn tests 0m 21s Tests failed in hadoop-yarn-api. +1 yarn tests 2m 2s Tests passed in hadoop-yarn-common. -1 yarn tests 190m 59s Tests failed in hadoop-yarn-server-resourcemanager.     241m 1s   Reason Tests FindBugs module:hadoop-yarn-server-resourcemanager Failed unit tests hadoop.yarn.conf.TestYarnConfigurationFields   hadoop.yarn.server.resourcemanager.rmcontainer.TestRMContainerImpl   hadoop.yarn.server.resourcemanager.TestApplicationCleanup   hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions   hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched   hadoop.yarn.server.resourcemanager.TestRMHA   hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerQueueACLs   hadoop.yarn.server.resourcemanager.TestResourceManager   hadoop.yarn.server.resourcemanager.TestApplicationMasterService   hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA   hadoop.yarn.server.resourcemanager.scheduler.TestSchedulerHealth   hadoop.yarn.server.resourcemanager.scheduler.TestSchedulerUtils   hadoop.yarn.server.resourcemanager.scheduler.capacity.TestWorkPreservingRMRestartForNodeLabel   hadoop.yarn.server.resourcemanager.TestAMAuthorization   hadoop.yarn.server.resourcemanager.TestApplicationMasterLauncher   hadoop.yarn.server.resourcemanager.TestResourceTrackerService   hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation   hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerQueueACLs   hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler   hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerDynamicBehavior   hadoop.yarn.server.resourcemanager.TestApplicationACLs   hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler   hadoop.yarn.server.resourcemanager.TestContainerResourceUsage   hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler   hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart   hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerNodeLabelUpdate   hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA   hadoop.yarn.server.resourcemanager.rmapp.TestApplicationTimeOut   hadoop.yarn.server.resourcemanager.scheduler.fair.TestAppRunnability   hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority   hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter   hadoop.yarn.server.resourcemanager.logaggregationstatus.TestRMAppLogAggregationStatus   hadoop.yarn.server.resourcemanager.rmapp.TestNodesListManager Timed out tests org.apache.hadoop.yarn.server.resourcemanager.TestRM   org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps   org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification   org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation   org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12761922/0002_YARN-3813.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / c890c51 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/9244/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt whitespace https://builds.apache.org/job/PreCommit-YARN-Build/9244/artifact/patchprocess/whitespace.txt Findbugs warnings https://builds.apache.org/job/PreCommit-YARN-Build/9244/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/9244/artifact/patchprocess/testrun_hadoop-yarn-api.txt hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/9244/artifact/patchprocess/testrun_hadoop-yarn-common.txt hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9244/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9244/testReport/ Java 1.7.0_55 uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/9244/console This message was automatically generated.
          Hide
          nijel nijel added a comment -

          thanks Rohith Sharma K S and Sunil G for the comments
          Updated patch with the the comment fix and test case for recovery.

          we are starting the monitor thread always regardless whether application demands for applicationtimeout or not. I feel we can have a configuration to enable this feature in RM level. Thoughts?

          As i pinged you offline, this service will consider only apps which are configured with a timeout. So leaving as a default service.

          RMAppTimeOutMonitor : When InterruptedException is thrown in the below code, thread should break or throw back exception. So, thread will die else thread wil be alive for ever

          The while loop is guarded for interrupted state

          Show
          nijel nijel added a comment - thanks Rohith Sharma K S and Sunil G for the comments Updated patch with the the comment fix and test case for recovery. we are starting the monitor thread always regardless whether application demands for applicationtimeout or not. I feel we can have a configuration to enable this feature in RM level. Thoughts? As i pinged you offline, this service will consider only apps which are configured with a timeout. So leaving as a default service. RMAppTimeOutMonitor : When InterruptedException is thrown in the below code, thread should break or throw back exception. So, thread will die else thread wil be alive for ever The while loop is guarded for interrupted state
          Hide
          sunilg Sunil G added a comment -

          Hi nijel
          Thank you for the updated patch. Couple of points.

          1. we are starting the monitor thread always regardless whether application demands for applicationtimeout or not. I feel we can have a configuration to enable this feature in RM level. Thoughts?
          2. I feel we can use ApplicationID instead of RMApp object itself to be stored in this monitor level.

          private ConcurrentMap<RMApp, Long> rmApps =
                new ConcurrentHashMap<RMApp, Long>();
          

          And we can use rmContext.getRMApps().get(appId) while iterating over the loop inside RMAppTimeOutMonitorThread

          Or Any advantages we have in using RMApp which I am missing, if so then its fine. Thoughts?

          Show
          sunilg Sunil G added a comment - Hi nijel Thank you for the updated patch. Couple of points. 1. we are starting the monitor thread always regardless whether application demands for applicationtimeout or not. I feel we can have a configuration to enable this feature in RM level. Thoughts? 2. I feel we can use ApplicationID instead of RMApp object itself to be stored in this monitor level. private ConcurrentMap<RMApp, Long > rmApps = new ConcurrentHashMap<RMApp, Long >(); And we can use rmContext.getRMApps().get(appId) while iterating over the loop inside RMAppTimeOutMonitorThread Or Any advantages we have in using RMApp which I am missing, if so then its fine. Thoughts?
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Thanks nijel for the patch.. Some comments

          1. RMAppTimeOutMonitor should be part of RMActiveServiceContext. In RMContext, RMAppTimeOutMonitor to RMActiveServiceContext.
          2. protected RMAppTimeOutMonitor rmAppTimeOutMonitor; in ResourceManager, can be moved to RMActiveServices.
          3. RMAppImpl.java : applicationTimeout variable can be reused.
          4. RMAppTimeOutMonitor : private ConcurrentMap<RMApp, Long> rmApps = new ConcurrentHashMap<RMApp, Long>(); , I think ApplicationId can be used. And for triggering KILL event, can get RMApp from rmContext and trigger KILL event.
          5. RMAppTimeOutMonitor : When InterruptedException is thrown in the below code, thread should break or throw back exception. So, thread will die else thread wil be alive for ever
            +        try {
            +          Thread.sleep(threadSleepTime);
            +        } catch (InterruptedException e1) {
            +          LOG.debug("RMAppTimeOut sleep is over. Going for next iteration.");
            +        }
            
          6. Tests : Need to add some more concrete tests covering up scenarios.
          Show
          rohithsharma Rohith Sharma K S added a comment - Thanks nijel for the patch.. Some comments RMAppTimeOutMonitor should be part of RMActiveServiceContext. In RMContext, RMAppTimeOutMonitor to RMActiveServiceContext. protected RMAppTimeOutMonitor rmAppTimeOutMonitor; in ResourceManager, can be moved to RMActiveServices. RMAppImpl.java : applicationTimeout variable can be reused. RMAppTimeOutMonitor : private ConcurrentMap<RMApp, Long> rmApps = new ConcurrentHashMap<RMApp, Long>(); , I think ApplicationId can be used. And for triggering KILL event, can get RMApp from rmContext and trigger KILL event. RMAppTimeOutMonitor : When InterruptedException is thrown in the below code, thread should break or throw back exception. So, thread will die else thread wil be alive for ever + try { + Thread .sleep(threadSleepTime); + } catch (InterruptedException e1) { + LOG.debug( "RMAppTimeOut sleep is over. Going for next iteration." ); + } Tests : Need to add some more concrete tests covering up scenarios.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          True, your point is correct so basically it should fail its not yet in running state even a boolean parameter for this should be sufficient I think !

          Show
          Naganarasimha Naganarasimha G R added a comment - True, your point is correct so basically it should fail its not yet in running state even a boolean parameter for this should be sufficient I think !
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          True, your point is correct so basically it should fail its not yet in running state even a boolean parameter for this should be sufficient I think !

          Show
          Naganarasimha Naganarasimha G R added a comment - True, your point is correct so basically it should fail its not yet in running state even a boolean parameter for this should be sufficient I think !
          Hide
          sunilg Sunil G added a comment -

          As I see this, the trigger point to identify an application to be Timedout (or killed) is based on the elapsed time (calculated from its submission time). If an application can be registered to RMAppTimeOutMonitor after submission, I feel we may need not have to worry about internal state.

          Show
          sunilg Sunil G added a comment - As I see this, the trigger point to identify an application to be Timedout (or killed) is based on the elapsed time (calculated from its submission time). If an application can be registered to RMAppTimeOutMonitor after submission, I feel we may need not have to worry about internal state.
          Hide
          nijel nijel added a comment -

          This patch will address the initial issue. But this will kill the application even it is in RUNNING state.

          As i understand the idea is to configure the states which the monitor needs to consider to kill the application. correct ?

          But one doubt i have is whether the user will be aware of all the intermediate states for an app ?

          Show
          nijel nijel added a comment - This patch will address the initial issue. But this will kill the application even it is in RUNNING state. As i understand the idea is to configure the states which the monitor needs to consider to kill the application. correct ? But one doubt i have is whether the user will be aware of all the intermediate states for an app ?
          Hide
          nijel nijel added a comment -

          Sorry for the long delay..

          Adding an initial patch.
          The action on timeout is considered as KILL.
          Please have a look. I will update the patch with more test cases after initial review.

          Thanks

          Show
          nijel nijel added a comment - Sorry for the long delay.. Adding an initial patch. The action on timeout is considered as KILL. Please have a look. I will update the patch with more test cases after initial review. Thanks
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Hi All [ nijel,Rohith Sharma K S,Devaraj K,Vinod Kumar Vavilapalli & Sunil G ],
          Would it be good to support the YARN-2487 scenario also here where in application is only in Submitted and Accepted state for a particular period then we can kill the application ? Basically we can also accept the states of Application along with application timeout period based on which we need to kill it. thoughts ?

          Show
          Naganarasimha Naganarasimha G R added a comment - Hi All [ nijel , Rohith Sharma K S , Devaraj K , Vinod Kumar Vavilapalli & Sunil G ], Would it be good to support the YARN-2487 scenario also here where in application is only in Submitted and Accepted state for a particular period then we can kill the application ? Basically we can also accept the states of Application along with application timeout period based on which we need to kill it. thoughts ?
          Hide
          nijel nijel added a comment -

          Thanks Sunil G and Devaraj K for the comments

          How frequently are you going to check this condition for each application?

          Plan is to have a configurable interval default to 30 sec (yarn.app.timeout.monitor.interval)

          Could we have a new TIMEOUT event in RMAppImpl for this. In that case, we may not need a flag.

          I feel having a TIMEOUT state for RMAppImpl would be proper here.

          ok. We will add a TIMEOUT state and handle the changes
          Due to this there will be few changes in app transitions, client package and the WEBUI

          I have a suggestion here.We can have a BasicAppMonitoringManager which can keep an entry of <appId, app.getSubmissionTime>.

          when the application gets submitted to RM then we can register the application with RMAppTimeOutMonitor using the user specified timeout.

          Yes. Good suggestion. This we will update as a registration mechanism. But since each application can have its own timeout period, the code reusability looks like minimal.

          RMAppTimeOutMonitor 
          	local map (appid, timeout)
          	add/register(appid, timeout)  --> from RMAppImpl
          	Run -> if app is running/submitted and elapsed the time, kill it. If already completed, remove from map.
          	No delete/unregister method  --> this application will be be removed from map from run method
          
          Show
          nijel nijel added a comment - Thanks Sunil G and Devaraj K for the comments How frequently are you going to check this condition for each application? Plan is to have a configurable interval default to 30 sec (yarn.app.timeout.monitor.interval) Could we have a new TIMEOUT event in RMAppImpl for this. In that case, we may not need a flag. I feel having a TIMEOUT state for RMAppImpl would be proper here. ok. We will add a TIMEOUT state and handle the changes Due to this there will be few changes in app transitions, client package and the WEBUI I have a suggestion here.We can have a BasicAppMonitoringManager which can keep an entry of <appId, app.getSubmissionTime>. when the application gets submitted to RM then we can register the application with RMAppTimeOutMonitor using the user specified timeout. Yes. Good suggestion. This we will update as a registration mechanism. But since each application can have its own timeout period, the code reusability looks like minimal. RMAppTimeOutMonitor local map (appid, timeout) add/register(appid, timeout) --> from RMAppImpl Run -> if app is running/submitted and elapsed the time, kill it. If already completed, remove from map. No delete/unregister method --> this application will be be removed from map from run method
          Hide
          devaraj.k Devaraj K added a comment -

          Thanks nijel and Rohith Sharma K S for the design proposal.

          New auxillary service : RMAppTimeOutService
          Responsibility is to track the running application. Simple logic

          //if job is running and the time elapsed kill
          if ((RMAppState == SUBMITTED/ACCEPTED/RUNNING) &&
          && (currentTime - app.getSubmitTime()) >= timeout

          How frequently are you going to check this condition for each application?

          Can we have a monitor something like RMAppTimeOutMonitor which extends AbstractLivelinessMonitor, when the application gets submitted to RM then we can register the application with RMAppTimeOutMonitor using the user specified timeout. And when the timeout reaches, RMAppTimeOutMonitor can trigger an event to take an action further.

          Yes, having a separate TIMEOUT event and TIMEOUT state is good approach and other option. Initially we consider to have new state TIMEOUT which require very huge changes across all the modules.

          I feel having a TIMEOUT state for RMAppImpl would be proper here. When RMAppTimeOutMonitor triggers an event on timeout for an application, RMAppImpl can move the state to TIMEOUT state from any of the non-final states and during the transition it can handle stopping the running attempt and the containers. I don't see here that there will be so many changes required for achieving it.

          Show
          devaraj.k Devaraj K added a comment - Thanks nijel and Rohith Sharma K S for the design proposal. New auxillary service : RMAppTimeOutService Responsibility is to track the running application. Simple logic //if job is running and the time elapsed kill if ((RMAppState == SUBMITTED/ACCEPTED/RUNNING) && && (currentTime - app.getSubmitTime()) >= timeout How frequently are you going to check this condition for each application? Can we have a monitor something like RMAppTimeOutMonitor which extends AbstractLivelinessMonitor, when the application gets submitted to RM then we can register the application with RMAppTimeOutMonitor using the user specified timeout. And when the timeout reaches, RMAppTimeOutMonitor can trigger an event to take an action further. Yes, having a separate TIMEOUT event and TIMEOUT state is good approach and other option. Initially we consider to have new state TIMEOUT which require very huge changes across all the modules. I feel having a TIMEOUT state for RMAppImpl would be proper here. When RMAppTimeOutMonitor triggers an event on timeout for an application, RMAppImpl can move the state to TIMEOUT state from any of the non-final states and during the transition it can handle stopping the running attempt and the containers. I don't see here that there will be so many changes required for achieving it.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Thanks Sunil G for going through the design doc and feedback.

          BasicAppMonitoringManager which can keep an entry of <appId, app.getSubmissionTime>.

          Basically we mean Auxillary service is a separate service that start a new thread monitoring running applications i.e. very similar to any other service in RM like ZKRMStateStore/ClientRMService.

          Could we have a new TIMEOUT event in RMAppImpl for this. In that case, we may not need a flag.

          Yes, having a separate TIMEOUT event and TIMEOUT state is good approach and other option. Initially we consider to have new state TIMEOUT which require very huge changes across all the modules. To keep it simple, able to manage in KILLED state with proper diagnostic message and having new flag. New flag is for identifying whether app is timeout or not, which require in calculating metrics and considering RM restart feature.

          Show
          rohithsharma Rohith Sharma K S added a comment - Thanks Sunil G for going through the design doc and feedback. BasicAppMonitoringManager which can keep an entry of <appId, app.getSubmissionTime>. Basically we mean Auxillary service is a separate service that start a new thread monitoring running applications i.e. very similar to any other service in RM like ZKRMStateStore/ClientRMService. Could we have a new TIMEOUT event in RMAppImpl for this. In that case, we may not need a flag. Yes, having a separate TIMEOUT event and TIMEOUT state is good approach and other option. Initially we consider to have new state TIMEOUT which require very huge changes across all the modules. To keep it simple, able to manage in KILLED state with proper diagnostic message and having new flag. New flag is for identifying whether app is timeout or not, which require in calculating metrics and considering RM restart feature.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          A few years ago when this came up, I recommended doing this on top of YARN. But I've seen this enough in the wild to yield now.

          It's a useful feature to come out of the box in YARN. Small enough, so I think we should go ahead with the implementation - not a lot of design dimensions.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - A few years ago when this came up, I recommended doing this on top of YARN. But I've seen this enough in the wild to yield now. It's a useful feature to come out of the box in YARN. Small enough, so I think we should go ahead with the implementation - not a lot of design dimensions.
          Hide
          sunilg Sunil G added a comment -

          Hi nijel
          Thanks for sharing the draft. I have couple of doubts.

          • Add a new auxiliary service RMAppTimeOutService to track the running applications and invoke the kill action.

            I have a suggestion here.We can have a BasicAppMonitoringManager which can keep an entry of <appId, app.getSubmissionTime>.
            AppMonitoringManager interface can expose apis like addAppMonitoringInfo, removeAppMonitoringInfo. In BasicAppMonitoringManager impl, a timer task can monitor the registered entries added via addAppMonitoringInfo during app submission time. If any apps times out, it can raise a TIMEOUT event to RMAppImpl.

          • Add a flag in RMApp to identify the timed out application. This is for metric purpose.

            Could we have a new TIMEOUT event in RMAppImpl for this. In that case, we may not need a flag.

          Show
          sunilg Sunil G added a comment - Hi nijel Thanks for sharing the draft. I have couple of doubts. Add a new auxiliary service RMAppTimeOutService to track the running applications and invoke the kill action. I have a suggestion here.We can have a BasicAppMonitoringManager which can keep an entry of <appId, app.getSubmissionTime>. AppMonitoringManager interface can expose apis like addAppMonitoringInfo, removeAppMonitoringInfo. In BasicAppMonitoringManager impl, a timer task can monitor the registered entries added via addAppMonitoringInfo during app submission time. If any apps times out, it can raise a TIMEOUT event to RMAppImpl. Add a flag in RMApp to identify the timed out application. This is for metric purpose. Could we have a new TIMEOUT event in RMAppImpl for this. In that case, we may not need a flag.
          Hide
          nijel nijel added a comment -

          Attached initial draft for work details.
          Please share your comments and thoughts

          Show
          nijel nijel added a comment - Attached initial draft for work details. Please share your comments and thoughts

            People

            • Assignee:
              rohithsharma Rohith Sharma K S
              Reporter:
              nijel nijel
            • Votes:
              1 Vote for this issue
              Watchers:
              30 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development