Details
-
Task
-
Status: Closed
-
Minor
-
Resolution: Not A Problem
-
None
-
None
-
None
-
None
Description
Say you have the following task config. Note all processes have max_failure = 1.
{ "processes": [ { "daemon": false, "name": "hello-0", "max_failures": 1, "ephemeral": false, "min_duration": 5, "cmdline": "while true; do echo `date`; sleep 60; done", "final": false }, { "daemon": false, "name": "hello-1", "max_failures": 1, "ephemeral": false, "min_duration": 5, "cmdline": "while true; do echo `date`; sleep 60; done", "final": false }, { "daemon": false, "name": "hello-2", "max_failures": 1, "ephemeral": false, "min_duration": 5, "cmdline": "while true; do echo `date`; sleep 60; done", "final": false } ], "name": "hello-0", "finalization_wait": 30, "max_failures": 1, "max_concurrency": 0, "resources": { "gpu": 0, "disk": 16777216, "ram": 1048576, "cpu": 0.1 }, "constraints": [] }
Say we kill one these thermos processes. In this case, the process gets restarted since it technically did not crash/fail. Even if you kill it with `kill -SIGSEGV <pid>` it still comes back up again and the number of failures is 0. This is being registered as the process being lost and that number correctly increases.
I think it makes sense to check the exit code on a process kill and count it a failure the err code is not `0`.
Note that if one the processes fails / crashes it is handled differently:
- on_killed
D0706 18:38:32.944282 12808 runner.py:156] Process on_killed ProcessStatus(seq=3, process='hello-2', start_time=None, coordinator_pid=None, pid=None, return_code=-9, state=4, stop_time=1499366312.421471, fork_time=None)
- on_failed
D0706 22:37:14.829272 23216 runner.py:138] Process on_failed ProcessStatus(seq=3, process='hello-bad', start_time=None, coordinator_pid=None, pid=None, return_code=139, state=5, stop_time=1499380634.768661, fork_time=None)
We can just check the `ProcessStatus.return_code` and act accordingly.