Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-4049

Allow user to control behavior of partitioned agents/tasks

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.1.0
    • agent, master

    Description

      At present, if an agent is partitioned away from the master, the master waits for a period of time (see MESOS-4048) before deciding that the agent is dead. Then it marks the agent as lost, sends TASK_LOST messages for all the tasks running on the agent, and instructs the agent to shutdown.

      Although this behavior is desirable for some/many users, it is not ideal for everyone. For example:

      • Some users might want to aggressively start a new replacement task (e.g., after one or two ping timeouts are missed); then when the old copy of the task comes back, they might want to make an intelligent decision about how to reconcile this situation (e.g., kill old, kill new, allow both to continue running).
      • Some frameworks might want different behavior from other frameworks, or to treat some tasks differently from other tasks. For example, if a task has a huge amount of state that would need to be regenerated to spin up another instance, the user might want to wait longer before starting a new task to increase the chance that the old task will reappear.

      To do this, we'd need to change task state so that a task can go from RUNNING to a new state (say UNKNOWN or WANDERING), and then from that state back to RUNNING (or perhaps we could keep the current "mark-lost-after-timeout" behavior as an option, in which case UNKNOWN could also transition to LOST). The agent would also keep its old slaveId when it reconnects.

      Attachments

        Issue Links

          Activity

            People

              neilc Neil Conway
              neilc Neil Conway
              Vinod Kone Vinod Kone
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: