[MESOS-4659] Avoid leaving orphan task after framework failure + master failover - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Accepted
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: master
Labels:
- failover
- mesosphere

Description

If a framework becomes disconnected from the master, its tasks are killed after waiting for failover_timeout.

However, if a master failover occurs but a framework never reconnects to the new master, we never kill any of the tasks associated with that framework. These tasks remain orphaned and presumably would need to be manually removed by the operator. Similarly, if a framework gets torn down or disconnects while it has running tasks on a partitioned agent, those tasks are not shutdown when the agent reregisters.

We should consider whether to kill such orphaned tasks automatically, likely after waiting for some (framework-configurable?) timeout.

Attachments

Issue Links

is duplicated by

MESOS-5378 Terminating a framework during master failover leads to orphaned tasks

Resolved

MESOS-5761 Improve the logic of orphan tasks

Resolved

is related to

MESOS-1719 Master should persist framework information

Accepted

relates to

MESOS-6419 The 'master/teardown' endpoint should support tearing down 'unregistered_frameworks'.

Resolved

MESOS-6602 Shutdown completed frameworks when unreachable agent re-registers

Resolved

MESOS-6136 Duplicate framework id handling

Open

(1 relates to)

Activity

People

Assignee:: Unassigned

Reporter:: Neil Conway

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 11/Feb/16 19:46

Updated:: 17/Feb/20 20:01