[MESOS-8391] Mesos agent doesn't notice that a pod task exits or crashes after the agent restart - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.5.0
Fix Version/s: 1.5.0
Component/s: agent, containerization, executor
Labels:
None

Target Version/s:

1.5.0
Story Points:
3

Description

(1) Agent doesn't detect that a pod task exits/crashes

Create a Marathon pod with two containers which just do sleep 10000.
Restart the Mesos agent on the node the pod got launched.
Kill one of the pod tasks

Expected result: The Mesos agent detects that one of the tasks got killed, and forwards TASK_FAILED status to Marathon.

Actual result: The Mesos agent does nothing, and the Mesos master thinks that both tasks are running just fine. Marathon doesn't take any action because it doesn't receive any update from Mesos.

(2) After the agent restart, it detects that the task crashed, forwards the correct status update, but the other task stays in `TASK_KILLING` state forever

Perform steps in (1).
Restart the Mesos agent

Expected result: The Mesos agent detects that one of the tasks got crashed, forwards the corresponding status update, and kills the other task too.

Actual result: The Mesos agent detects that one of the tasks got crashed, forwards the corresponding status update, but the other task stays in `TASK_KILLING` state forever.

Please note, that after another agent restart, the other tasks gets finally killed and the correct status updates get propagated all the way to Marathon.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

testing-log-2.tar.gz
05/Jan/18 23:36
514 kB
Gilbert Song

Issue Links

is broken by

MESOS-7506 Multiple tests leave orphan containers.

Resolved

is related to

MESOS-8423 Improving debug logging in Mesos Containerizer.

Reviewable

Activity

People

Assignee:: Andrei Budnik

Reporter:: Ivan Chernetsky

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 04/Jan/18 02:27

Updated:: 26/Mar/18 11:57

Resolved:: 13/Jan/18 01:25

Details

Description

(1) Agent doesn't detect that a pod task exits/crashes

(2) After the agent restart, it detects that the task crashed, forwards the correct status update, but the other task stays in TASK_KILLING state forever

Attachments

Attachments

Issue Links

Activity

People

Dates

(2) After the agent restart, it detects that the task crashed, forwards the correct status update, but the other task stays in `TASK_KILLING` state forever