Currently bouncing an agent is not possible. After launching a child process, an agent adds an entry in its Process Inventory and uses a Process handle to call waitFor() to detect child termination. When an agent restarts, it looses all its children and has no means to recover its inventory.
The proposal is to change this behavior to allow agents to bounce and subsequently recover their child processes. The bounce may be required to update agent code for example.
An agent has two options to recover its child processes based on cgroup availability.
If cgroups are enabled, an agent on startup will read all PIDs from cgroup.proc file. These PIDs reflect running child processes on a node. An agent will create a skeleton inventory entry for each PID and fill in the details when the OR state is received. The agent will use a PID to find a matching process in the OR state. After the new inventory is recovered, the timer based inventory update will fetch PIDs from cgroup.proc file again and reconcile this with its inventory. To detect child process termination an agent will compare PIDs in inventory agains PIDs from cgroup.proc. If a PID is in inventory and not present in cgroup.proc, an agent will mark such process as Stopped if deallocate flag is true, or will mark it as Failed if deallocate flag is false. Any AP process that is no longer running will be marked as Stopped.
If cgroups are not enabled, an agent will recover its inventory from the OR state. While in this mode, an agent will disable its Rogue Process Detector and not attempt to detect alien processes. The timer based inventory update will fetch PIDs from the OS (using ps command) and reconcile this with its inventory. To detect child process termination an agent will compare PIDs in inventory against PIDs obtained from the OS. If a PID is in inventory and not present in the OS, an agent will mark such process as Stopped if deallocate flag is true, or will mark it as Failed if deallocate flag is false. Any AP process that is no longer running will be marked as Stopped.
- An agent will no longer call waitFor() on a Process object returned from a ProcessBuilder when a child process is launched
- An agent will continue to drain stdout and stderr of a child process to prevent the child (duccling) from hanging and to receive OS errors which may occur when exec'ing a process (bad cmd line, etc). After duccling calls execve(), child process stdout and stderr are redirected to /dev/null and nothing is expected from these streams by the agent.
- A child process will communicate state changes and initialization status to an agent via a provided port. Question here is how the port is provided to a child. Currently an agent uses -D (or env) to communicate its listener port to a child. The port is determined when an agent starts up and can potentially be different when an agent is bounced. So we either use a Registry to store agent's port for a child to lookup or insist that an agent has a fixed port. If an agent is bounced and such port is not available what should happen?
- An agent should support a new flag "-Dclean=[true|false]" which on startup will force an agent to clean up (terminate) all child processes found in cgroups. The code for doing this is already in place and its a default agent procedure on startup. Still a question if this should be a default behavior. Also the same flag should control what happens on agent shutdown. If clean= true, the agent will terminate its children otherwise child processes will remain running.