[MESOS-4106] The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, 0.23.1, 0.24.0, 0.24.1, 0.25.0
Fix Version/s: 0.24.2, 0.25.1, 0.26.0
Component/s: None
Labels:
None

Description

This was reported by tan experimenting with health checks. Many tasks were launched with the following health check, taken from the container stdout/stderr:

Launching health check process: /usr/local/libexec/mesos/mesos-health-check --executor=(1)@127.0.0.1:39629 --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0} --task_id=sleepy-2

This should have led to all tasks getting killed due to --consecutive_failures being set, however, only some tasks get killed, while other remain running.

It turns out that the health check binary does a send and promptly exits. Unfortunately, this may lead to a message drop since libprocess may not have sent this message over the socket by the time the process exits.

We work around this in the command executor with a manual sleep, which has been around since the svn days. See here.

Attachments

Issue Links

relates to

MESOS-1613 HealthCheckTest.ConsecutiveFailures is flaky

Resolved

MESOS-4111 Provide a means for libprocess users to exit while ensuring messages are flushed.

Accepted

Activity

People

Assignee:: Benjamin Mahler

Reporter:: Benjamin Mahler

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 09/Dec/15 23:44

Updated:: 27/Feb/16 00:44

Resolved:: 10/Dec/15 02:50