Uploaded image for project: 'Aurora'
  1. Aurora
  2. AURORA-1942

Improve Aurora behavior with regards to Mesos Agents violating reregistration timeouts

    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Scheduler
    • None

    Description

      A Mesos Agent Lost message can be received in two scenarios resulting in different outcomes:

      1) A Mesos Agent can fail the health check done by the Mesos Master (max_agent_ping_timeouts violation) which leads to an Agent Lost message along with TASK_LOST messages for each task running on the unhealthy Agent.

      2) A Mesos Agent can fail to re-register after an election has taken place (agent_reregister_timeout violation). In this situation the newly elected Mesos master, because Master's do not store any information concerning the tasks that are currently running, is unable to send a TASK_LOST message for the tasks that were running on the Agent that failed to re-register.

      Scenario number 2 can lead to (a) "missing" instances for the tasks scheduled on the rogue Agent until an explicit reconciliation is done and/or (b) "leaked" tasks if the Agent re-registers after Aurora has replaced the missing tasks that will only be cleaned upon an implicit reconciliation.

      For (a), one solution is to transition tasks in a missing Agent to the LOST state upon receiving a Slave Lost message.

      Attachments

        Activity

          People

            Unassigned Unassigned
            renan Renan DelValle
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: