[AURORA-1335] Thermos should not immediately resort to killing processes - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: None
Fix Version/s: None
Component/s: Executor, Thermos
Labels:
None

Description

As a user of Aurora, I would like my processes to be terminated in a graceful manner so that they have time to properly flush their buffers and cleanup resources such as database connections.

In its current form, the executor sends a TERM signal which is immediately followed by a KILL signal. As an example, see the timings in the following debug log output of a thermos runner:

D0526 13:20:56.829274 29 ckpt.py:348] Flipping task state from ACTIVE to CLEANING
D0526 13:20:56.829396 29 runner.py:242] _on_task_transition: TaskStatus(state=5, runner_uid=0, runner_pid=29, timestamp_ms=1432639256829)
D0526 13:20:56.829545 29 runner.py:188] Task on_cleaning(TaskStatus(state=5, runner_uid=0, runner_pid=29, timestamp_ms=1432639256829))
TaskRunnerHelper.terminate_process(service)
D0526 13:20:56.832633 29 helper.py:238]    => SIGTERM pid 119
D0526 13:20:56.832775 29 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining: 59.9783368111
D0526 13:20:56.834014 118 process.py:103] [process:  118=service]: child state transition [/var/run/thermos/checkpoints/1432639067962-service-200-ps-8-3ea6f2a6-535d-4565-ab7c-5fc628f29515/coordinator.service] <= RunnerCkpt(task_status=None, process_status=ProcessStatus(seq=3, process=u'service', start_time=None, coordinator_pid=None, pid=None, return_code=-15, state=4, stop_time=1432639256.833447, fork_time=None), runner_header=None)
D0526 13:20:56.834566 118 process.py:103] [process:  118=service]: Coordinator exiting.
D0526 13:20:56.835757 29 runner.py:873] Run loop: Work to be done within 1.0s
D0526 13:20:56.836005 29 recordio.py:137] /var/run/thermos/checkpoints/1432639067962-service-200-ps-8-3ea6f2a6-535d-4565-ab7c-5fc628f29515/coordinator.service has no data (current offset = 177)
D0526 13:20:56.836102 29 muxer.py:155] select() returning 1 updates:
D0526 13:20:56.836200 29 muxer.py:157]   = RunnerCkpt(task_status=None, process_status=ProcessStatus(seq=3, process='service', start_time=None, coordinator_pid=None, pid=None, return_code=-15, state=4, stop_time=1432639256.833447, fork_time=None), runner_header=None)
D0526 13:20:56.836282 29 ckpt.py:379] Running state machine for process=service/seq=3
D0526 13:20:56.836913 29 runner.py:238] _on_process_transition: ProcessStatus(seq=3, process='service', start_time=None, coordinator_pid=None, pid=None, return_code=-15, state=4, stop_time=1432639256.833447, fork_time=None)
D0526 13:20:56.837102 29 runner.py:156] Process on_killed ProcessStatus(seq=3, process='service', start_time=None, coordinator_pid=None, pid=None, return_code=-15, state=4, stop_time=1432639256.833447, fork_time=None)
D0526 13:20:56.837189 29 helper.py:244] TaskRunnerHelper.kill_process(service)
D0526 13:20:56.837582 29 helper.py:252]    => SIGKILL coordinator group 118
D0526 13:20:56.837745 29 helper.py:255]    => SIGKILL coordinator 118
D0526 13:20:56.838052 29 muxer.py:94] unregistering service
D0526 13:20:56.838052 29 runner.py:160] Process killed, marking it as a loss.
D0526 13:20:56.838052 29 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining: 59.9730448723
D0526 13:20:56.844118 29 runner.py:873] Run loop: Work to be done within 1.0s
D0526 13:20:56.894645 64 process.py:103] [process:   64=reverse_proxy]: child state transition [/var/run/thermos/checkpoints/1432639067962-service-200-ps-8-3ea6f2a6-535d-4565-ab7c-5fc628f29515/coordinator.reverse_proxy] <= RunnerCkpt(task_status=None, process_status=ProcessStatus(seq=3, process=u'reverse_proxy', start_time=None, coordinator_pid=None, pid=None, return_code=-15, state=4, stop_time=1432639256.893275, fork_time=None), runner_header=None)
D0526 13:20:56.894645 64 process.py:103] [process:   64=reverse_proxy]: Coordinator exiting.
D0526 13:20:57.849862 29 helper.py:376] Detected terminated process: pid=118, status=9, rusage=resource.struct_rusage(ru_utime=0.008, ru_stime=0.024, ru_maxrss=19080, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=2448, ru_majflt=0, ru_nswap=0, ru_inblock=0, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=20, ru_nivcsw=14)
D0526 13:20:57.850090 29 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining: 58.9610338211
D0526 13:20:57.852466 29 runner.py:870] Run loop: No more work to be done in state CLEANING
D0526 13:20:57.852730 29 ckpt.py:348] Flipping task state from CLEANING to FINALIZING

Expected behavior would be a that Thermos only resorts to killing when the application does not honor the termination requests.

Using the HTTP signals `/quitquitquit` and `/abortabortabort` is not an option due to inherent security problems of the unauthenticated requests.

Thermos should not immediately resort to killing processes

Details

Description

Attachments

Activity

People

Dates