Description
As a user of Aurora, I would like my processes to be terminated in a graceful manner so that they have time to properly flush their buffers and cleanup resources such as database connections.
In its current form, the executor sends a TERM signal which is immediately followed by a KILL signal. As an example, see the timings in the following debug log output of a thermos runner:
D0526 13:20:56.829274 29 ckpt.py:348] Flipping task state from ACTIVE to CLEANING D0526 13:20:56.829396 29 runner.py:242] _on_task_transition: TaskStatus(state=5, runner_uid=0, runner_pid=29, timestamp_ms=1432639256829) D0526 13:20:56.829545 29 runner.py:188] Task on_cleaning(TaskStatus(state=5, runner_uid=0, runner_pid=29, timestamp_ms=1432639256829)) TaskRunnerHelper.terminate_process(service) D0526 13:20:56.832633 29 helper.py:238] => SIGTERM pid 119 D0526 13:20:56.832775 29 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining: 59.9783368111 D0526 13:20:56.834014 118 process.py:103] [process: 118=service]: child state transition [/var/run/thermos/checkpoints/1432639067962-service-200-ps-8-3ea6f2a6-535d-4565-ab7c-5fc628f29515/coordinator.service] <= RunnerCkpt(task_status=None, process_status=ProcessStatus(seq=3, process=u'service', start_time=None, coordinator_pid=None, pid=None, return_code=-15, state=4, stop_time=1432639256.833447, fork_time=None), runner_header=None) D0526 13:20:56.834566 118 process.py:103] [process: 118=service]: Coordinator exiting. D0526 13:20:56.835757 29 runner.py:873] Run loop: Work to be done within 1.0s D0526 13:20:56.836005 29 recordio.py:137] /var/run/thermos/checkpoints/1432639067962-service-200-ps-8-3ea6f2a6-535d-4565-ab7c-5fc628f29515/coordinator.service has no data (current offset = 177) D0526 13:20:56.836102 29 muxer.py:155] select() returning 1 updates: D0526 13:20:56.836200 29 muxer.py:157] = RunnerCkpt(task_status=None, process_status=ProcessStatus(seq=3, process='service', start_time=None, coordinator_pid=None, pid=None, return_code=-15, state=4, stop_time=1432639256.833447, fork_time=None), runner_header=None) D0526 13:20:56.836282 29 ckpt.py:379] Running state machine for process=service/seq=3 D0526 13:20:56.836913 29 runner.py:238] _on_process_transition: ProcessStatus(seq=3, process='service', start_time=None, coordinator_pid=None, pid=None, return_code=-15, state=4, stop_time=1432639256.833447, fork_time=None) D0526 13:20:56.837102 29 runner.py:156] Process on_killed ProcessStatus(seq=3, process='service', start_time=None, coordinator_pid=None, pid=None, return_code=-15, state=4, stop_time=1432639256.833447, fork_time=None) D0526 13:20:56.837189 29 helper.py:244] TaskRunnerHelper.kill_process(service) D0526 13:20:56.837582 29 helper.py:252] => SIGKILL coordinator group 118 D0526 13:20:56.837745 29 helper.py:255] => SIGKILL coordinator 118 D0526 13:20:56.838052 29 muxer.py:94] unregistering service D0526 13:20:56.838052 29 runner.py:160] Process killed, marking it as a loss. D0526 13:20:56.838052 29 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining: 59.9730448723 D0526 13:20:56.844118 29 runner.py:873] Run loop: Work to be done within 1.0s D0526 13:20:56.894645 64 process.py:103] [process: 64=reverse_proxy]: child state transition [/var/run/thermos/checkpoints/1432639067962-service-200-ps-8-3ea6f2a6-535d-4565-ab7c-5fc628f29515/coordinator.reverse_proxy] <= RunnerCkpt(task_status=None, process_status=ProcessStatus(seq=3, process=u'reverse_proxy', start_time=None, coordinator_pid=None, pid=None, return_code=-15, state=4, stop_time=1432639256.893275, fork_time=None), runner_header=None) D0526 13:20:56.894645 64 process.py:103] [process: 64=reverse_proxy]: Coordinator exiting. D0526 13:20:57.849862 29 helper.py:376] Detected terminated process: pid=118, status=9, rusage=resource.struct_rusage(ru_utime=0.008, ru_stime=0.024, ru_maxrss=19080, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=2448, ru_majflt=0, ru_nswap=0, ru_inblock=0, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=20, ru_nivcsw=14) D0526 13:20:57.850090 29 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining: 58.9610338211 D0526 13:20:57.852466 29 runner.py:870] Run loop: No more work to be done in state CLEANING D0526 13:20:57.852730 29 ckpt.py:348] Flipping task state from CLEANING to FINALIZING
Expected behavior would be a that Thermos only resorts to killing when the application does not honor the termination requests.
Using the HTTP signals `/quitquitquit` and `/abortabortabort` is not an option due to inherent security problems of the unauthenticated requests.