Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-4106

The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, 0.23.1, 0.24.0, 0.24.1, 0.25.0
    • 0.24.2, 0.25.1, 0.26.0
    • None
    • None

    Description

      This was reported by tan experimenting with health checks. Many tasks were launched with the following health check, taken from the container stdout/stderr:

      Launching health check process: /usr/local/libexec/mesos/mesos-health-check --executor=(1)@127.0.0.1:39629 --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0} --task_id=sleepy-2
      

      This should have led to all tasks getting killed due to --consecutive_failures being set, however, only some tasks get killed, while other remain running.

      It turns out that the health check binary does a send and promptly exits. Unfortunately, this may lead to a message drop since libprocess may not have sent this message over the socket by the time the process exits.

      We work around this in the command executor with a manual sleep, which has been around since the svn days. See here.

      Attachments

        Issue Links

          Activity

            People

              bmahler Benjamin Mahler
              bmahler Benjamin Mahler
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: