We have a simple test that launches a pod with two containers (one writes in a file and the other reads it). This test is flaky because the container sometimes fails to start.
Marathon app definition:
During the test, Marathon tries to launch the pod but doesn't receive a TASK_RUNNING for the first container and so after 2min decides to kill the pod which also fails.
Agent sandbox (attached to this ticket minus docker layers, since they're too big to attach) shows that one of the containers wasn't started properly - the last line in the agent log says:
Until then the log looks pretty unspektakular.
Afterwards, Marathon tries to kill the container repeatedly, but doesn't succeed - the executor receives the reuests but doesn't send anything back:
Relevant Ids for grepping the logs: