Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-783

Master::killTask must not answer with TASK_LOST when the task is unknown.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 0.14.0, 0.14.1, 0.14.2, 0.15.0
    • 0.19.0
    • None

    Description

      When the Master is asked to kill a task and it knows of the framework but it cannot locate the TaskID, the Master replies with TASK_LOST.

      This is normally ok, however, consider a failed over Master:
      --> Master fails over.
      --> Framework F re-registers.
      --> Slave with Task T in TASK_RUNNING has not yet re-registered.
      --> Master::killTask(F, T) cannot find T and replies with TASK_LOST.
      --> Slave re-registers with Task T in TASK_RUNNING.
      --> Now we've told the framework the task was LOST but it is left RUNNING.

      The simple fix here is to simply not reply in such cases and rely on a later reconciliation request.

      In the presence of a stateful master (MESOS-764), we can reliably reply with TASK_LOST if the slave is not in the Registrar, otherwise we must remain silent as the slave will be possibly re-registering with the correct state of the TASK. Ideally we can postpone the kill task message for the slave so that once it re-registers we can send it, but this is a bit complicated to implement and reconciliation can help with this.

      Attachments

        Issue Links

          Activity

            People

              bmahler Benjamin Mahler
              bmahler Benjamin Mahler
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: