Uploaded image for project: 'Aurora'
  1. Aurora
  2. AURORA-1941

Cause container restart when a process is killed with a signal.

    Details

    • Type: Task
    • Status: Closed
    • Priority: Minor
    • Resolution: Not A Problem
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Say you have the following task config. Note all processes have max_failure = 1.

      {
          "processes": [
              {
                  "daemon": false, 
                  "name": "hello-0", 
                  "max_failures": 1, 
                  "ephemeral": false, 
                  "min_duration": 5, 
                  "cmdline": "while true; do echo `date`; sleep 60; done", 
                  "final": false
              }, 
              {
                  "daemon": false, 
                  "name": "hello-1", 
                  "max_failures": 1, 
                  "ephemeral": false, 
                  "min_duration": 5, 
                  "cmdline": "while true; do echo `date`; sleep 60; done", 
                  "final": false
              }, 
              {
                  "daemon": false, 
                  "name": "hello-2", 
                  "max_failures": 1, 
                  "ephemeral": false, 
                  "min_duration": 5, 
                  "cmdline": "while true; do echo `date`; sleep 60; done", 
                  "final": false
              }
          ], 
          "name": "hello-0", 
          "finalization_wait": 30, 
          "max_failures": 1, 
          "max_concurrency": 0, 
          "resources": {
              "gpu": 0, 
              "disk": 16777216, 
              "ram": 1048576, 
              "cpu": 0.1
          }, 
          "constraints": []
      }
      

      Say we kill one these thermos processes. In this case, the process gets restarted since it technically did not crash/fail. Even if you kill it with `kill -SIGSEGV <pid>` it still comes back up again and the number of failures is 0. This is being registered as the process being lost and that number correctly increases.

      I think it makes sense to check the exit code on a process kill and count it a failure the err code is not `0`.

      Note that if one the processes fails / crashes it is handled differently:

      • on_killed
        D0706 18:38:32.944282 12808 runner.py:156] Process on_killed ProcessStatus(seq=3, process='hello-2', start_time=None, coordinator_pid=None, pid=None, return_code=-9, state=4, stop_time=1499366312.421471, fork_time=None)
        
      • on_failed
        D0706 22:37:14.829272 23216 runner.py:138] Process on_failed ProcessStatus(seq=3, process='hello-bad', start_time=None, coordinator_pid=None, pid=None, return_code=139, state=5, stop_time=1499380634.768661, fork_time=None)
        

      We can just check the `ProcessStatus.return_code` and act accordingly.

        Activity

        Hide
        rezam Reza Motamedi added a comment -

        I did not think about this in advance. I guess there is no way to kill a process and have it exit with exit code 0.

        Show
        rezam Reza Motamedi added a comment - I did not think about this in advance. I guess there is no way to kill a process and have it exit with exit code 0.
        Hide
        rezam Reza Motamedi added a comment -

        This does not seem to be a problem. It has been like it by design.

        Show
        rezam Reza Motamedi added a comment - This does not seem to be a problem. It has been like it by design.

          People

          • Assignee:
            Unassigned
            Reporter:
            rezam Reza Motamedi
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development