Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
An executor's initial tasks may be killed even after it has been registered. In that case, the executor could linger forever.
In MESOS-8411, we have a short-term fix that checks an executor's completed and terminated task queues to see if it had ever received any tasks. if the check is false and there is no queued or launched tasks, agent will shutdown the executor.
However, this check is not bullet-proof. The completedTasks queue is a circular_buffer (current size 200) which means earlier completed tasks that are possibly updated by the executor may be ejected and thus are missed by this check. This would lead to false positive shutdowns.
Per discussion with vinodkone and bmahler. There are two long term solutions.
The first one is to checkpoint additional executor states which indicates whether the executor has ever received any tasks (no more inference from task queue status);
The alternative is to add timeouts in the executor driver to shutdown lingering executors automatically.
Attachments
Issue Links
- relates to
-
MESOS-8411 Killing a queued task can lead to the command executor never terminating.
- Resolved