Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-4106

The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, 0.23.1, 0.24.0, 0.24.1, 0.25.0
    • Fix Version/s: 0.24.2, 0.25.1, 0.26.0
    • Component/s: None
    • Labels:
      None

      Description

      This was reported by Tyler Neely experimenting with health checks. Many tasks were launched with the following health check, taken from the container stdout/stderr:

      Launching health check process: /usr/local/libexec/mesos/mesos-health-check --executor=(1)@127.0.0.1:39629 --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0} --task_id=sleepy-2
      

      This should have led to all tasks getting killed due to --consecutive_failures being set, however, only some tasks get killed, while other remain running.

      It turns out that the health check binary does a send and promptly exits. Unfortunately, this may lead to a message drop since libprocess may not have sent this message over the socket by the time the process exits.

      We work around this in the command executor with a manual sleep, which has been around since the svn days. See here.

        Issue Links

          Activity

          Hide
          bmahler Benjamin Mahler added a comment -

          This is also possibly the reason for MESOS-1613.

          Show
          bmahler Benjamin Mahler added a comment - This is also possibly the reason for MESOS-1613 .
          Hide
          haosdent@gmail.com haosdent added a comment -

          Does add ' os::sleep(Seconds(1));' enough?

          Show
          haosdent@gmail.com haosdent added a comment - Does add ' os::sleep(Seconds(1));' enough?
          Hide
          bmahler Benjamin Mahler added a comment -

          haosdent: From my testing so far, yes. I will send a fix and re-enable the test from MESOS-1613.

          Show
          bmahler Benjamin Mahler added a comment - haosdent : From my testing so far, yes. I will send a fix and re-enable the test from MESOS-1613 .
          Show
          bmahler Benjamin Mahler added a comment - https://reviews.apache.org/r/41178/
          Hide
          neilc Neil Conway added a comment -

          To fix the problem properly (without a sleep hack), it seems we need something akin to a flush primitive in libprocess. For example, we could provide a variant of send that returns a future, where the future is only satisfied once the associated message has been delivered to the kernel.

          Show
          neilc Neil Conway added a comment - To fix the problem properly (without a sleep hack), it seems we need something akin to a flush primitive in libprocess. For example, we could provide a variant of send that returns a future, where the future is only satisfied once the associated message has been delivered to the kernel.
          Hide
          bmahler Benjamin Mahler added a comment -

          Yeah I had the same thought when I was looking at MESOS-243, but now we also have process::finalize that could be the mechanism for cleanly shutting down before exit calls. I'll file a ticket to express this issue more generally (MESOS-243 was the original but is specific to the executor driver).

          Show
          bmahler Benjamin Mahler added a comment - Yeah I had the same thought when I was looking at MESOS-243 , but now we also have process::finalize that could be the mechanism for cleanly shutting down before exit calls. I'll file a ticket to express this issue more generally ( MESOS-243 was the original but is specific to the executor driver).
          Hide
          neilc Neil Conway added a comment -

          Sounds good – maybe add a TODO to the fix for this bug to put in a more robust fix later?

          Show
          neilc Neil Conway added a comment - Sounds good – maybe add a TODO to the fix for this bug to put in a more robust fix later?
          Hide
          bmahler Benjamin Mahler added a comment -

          Yeah, I'll reference MESOS-4111 now that we have it, I'll also reference it in the existing command executor sleep.

          Show
          bmahler Benjamin Mahler added a comment - Yeah, I'll reference MESOS-4111 now that we have it, I'll also reference it in the existing command executor sleep.
          Hide
          bmahler Benjamin Mahler added a comment -

          Marking as fixed in 0.27.0 for now, but would be prudent to pull it in to the ongoing 0.26.0 release:

          commit 7aa7957fb9a5bce5a9d8ae5a2560d1bde97d1274
          Author: Benjamin Mahler <benjamin.mahler@gmail.com>
          Date:   Wed Dec 9 17:43:27 2015 -0800
          
              Fixed a message dropping bug in the health checker.
          
              Much like in the command executor, we need to sleep after we send
              the final message in the health checker. Otherwise, we may exit
              before libprocess is able to finish sending the message over the
              local network.
          
              This led to the following issues:
              https://issues.apache.org/jira/browse/MESOS-1613
              https://issues.apache.org/jira/browse/MESOS-4106
          
              Review: https://reviews.apache.org/r/41178
          
          Show
          bmahler Benjamin Mahler added a comment - Marking as fixed in 0.27.0 for now, but would be prudent to pull it in to the ongoing 0.26.0 release: commit 7aa7957fb9a5bce5a9d8ae5a2560d1bde97d1274 Author: Benjamin Mahler <benjamin.mahler@gmail.com> Date: Wed Dec 9 17:43:27 2015 -0800 Fixed a message dropping bug in the health checker. Much like in the command executor, we need to sleep after we send the final message in the health checker. Otherwise, we may exit before libprocess is able to finish sending the message over the local network. This led to the following issues: https://issues.apache.org/jira/browse/MESOS-1613 https://issues.apache.org/jira/browse/MESOS-4106 Review: https://reviews.apache.org/r/41178
          Hide
          bbannier Benjamin Bannier added a comment - - edited

          Late to the party as this already went in.

          Just sleeping here to have the message out is a very weak guarantee (it does not guarantee that the message was actually sent). What one should probably do instead to make this robust is block until a state change in executor happens (with a timeout), e.g., observe change of state of taskID via querying the executor.

          Show
          bbannier Benjamin Bannier added a comment - - edited Late to the party as this already went in. Just sleeping here to have the message out is a very weak guarantee (it does not guarantee that the message was actually sent). What one should probably do instead to make this robust is block until a state change in executor happens (with a timeout), e.g., observe change of state of taskID via querying the executor .
          Hide
          tnachen Timothy Chen added a comment -

          Thanks for fixing my bug!

          Show
          tnachen Timothy Chen added a comment - Thanks for fixing my bug!
          Hide
          bmahler Benjamin Mahler added a comment -

          I'm not sure we should say sleeping provides a "very weak guarantee", there is indeed no guarantee with a sleep that the message is sent.

          The approach you've suggested with querying with a timeout still provides no form of guarantee, unless you are going to wait indefinitely or use the timeout mentioned to trigger a retry rather than an exit (what did you intend to happen after the timeout?). This approach is guaranteeing application-level delivery, and we generally just use an "acknowledgement" message with retries to do this, rather than a separate query.

          However, since the executor resides on the same machine, and executor failover is not supported, we're unlikely to bother implementing acknowledgements with retries here. We only need to wait for the data to be sent on the socket (this gives a "weak guarantee": e.g. if there are no socket errors (note that both ends of the socket are within the same machine), and the executor remains up, the message will eventually be processed by the executor). MESOS-4111 discusses the general issue of being able to exit after ensuring that messages are processed in libprocess.

          In the case of the long-standing command executor sleep, we needed to handle agent failure. So we are already using acknowledgements there, and can use them to stop() cleanly.

          Show
          bmahler Benjamin Mahler added a comment - I'm not sure we should say sleeping provides a "very weak guarantee", there is indeed no guarantee with a sleep that the message is sent. The approach you've suggested with querying with a timeout still provides no form of guarantee, unless you are going to wait indefinitely or use the timeout mentioned to trigger a retry rather than an exit (what did you intend to happen after the timeout?). This approach is guaranteeing application-level delivery, and we generally just use an "acknowledgement" message with retries to do this, rather than a separate query. However, since the executor resides on the same machine, and executor failover is not supported, we're unlikely to bother implementing acknowledgements with retries here. We only need to wait for the data to be sent on the socket (this gives a "weak guarantee": e.g. if there are no socket errors (note that both ends of the socket are within the same machine), and the executor remains up, the message will eventually be processed by the executor). MESOS-4111 discusses the general issue of being able to exit after ensuring that messages are processed in libprocess. In the case of the long-standing command executor sleep, we needed to handle agent failure. So we are already using acknowledgements there, and can use them to stop() cleanly.

            People

            • Assignee:
              bmahler Benjamin Mahler
              Reporter:
              bmahler Benjamin Mahler
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development