[MESOS-8111] Mesos sees task as running, but cannot kill it because the agent is offline - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 1.2.3
Fix Version/s: None
Component/s: master
Labels:
None
Environment:

DC/OS 1.9.4

Description

After scaling down a cluster, the master is reporting a task as running although the slave has been long gone.
At the same time it reports it can't kill it because the agent is offline

I1018 16:55:22.000000  6976 master.cpp:4913] Processing KILL call for task 'spark.7b59a77b-b353-11e7-addd-b29ecbf071e1' of framework 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101
W1018 16:55:22.000000  6976 master.cpp:5000] Cannot kill task spark.7b59a77b-b353-11e7-addd-b29ecbf071e1 of framework 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101 because the agent 4d2a982a-0e62-4471-88e8-8df9cc0ae437-S129 at slave(1)@10.0.0.81:5051 (10.0.0.81) is disconnected. Kill will be retried if the agent re-registers

Clearly, if the agent is offline the task is also not running. Also not sure waiting indefinitely for an agent to recover is a good strategy.

Attachments

Activity

People

Assignee:: Vinod Kone

Reporter:: Cosmin Lehene

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 18/Oct/17 17:05

Updated:: 14/Nov/17 00:01

Resolved:: 14/Nov/17 00:01