Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-305

Inform the frameworks / slaves about a master failover

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.13.0
    • Component/s: None
    • Labels:
      None

      Description

      With the recent changes in the master detecter code, we no longer send 'NoMasterDetected' to the scheduler driver, which in turn means the 'disconnected' scheduler callback is never invoked.

      At Twitter this manifested as a spew of LOST tasks whenever a master failover happens. This is because the scheduler holds on to offers for a while and never knows about the invalidity of offers, until after tasks are launched. Though this is a race, it is ideal to minimize this window as much as possible by informing the scheduler of the master failover.

        Activity

        Hide
        vinodkone Vinod Kone added a comment -

        This seems to be causing a slew of LOST tasks @Twitter, whenever a master failsover.

        Benjamin Hindman would you have some time to take a look at this and see if we there is a short-term fix for this. IIUC, we were waiting on leader detector refactor before fixing this.

        Show
        vinodkone Vinod Kone added a comment - This seems to be causing a slew of LOST tasks @Twitter, whenever a master failsover. Benjamin Hindman would you have some time to take a look at this and see if we there is a short-term fix for this. IIUC, we were waiting on leader detector refactor before fixing this.
        Hide
        bmahler Benjamin Mahler added a comment -

        I'm working on a fix for this as we discussed offline.

        For transparency, we need to adjust the master detector to allow the messages. As result, there needs to be changes to the master as well to ensure that after a network partition, we disallow disconnected slaves from re-registering. This is because we've already informed frameworks of LOST tasks upon disconnecting the slave.

        Show
        bmahler Benjamin Mahler added a comment - I'm working on a fix for this as we discussed offline. For transparency, we need to adjust the master detector to allow the messages. As result, there needs to be changes to the master as well to ensure that after a network partition, we disallow disconnected slaves from re-registering. This is because we've already informed frameworks of LOST tasks upon disconnecting the slave.
        Show
        bmahler Benjamin Mahler added a comment - Chain of reviews: https://reviews.apache.org/r/10160/ https://reviews.apache.org/r/10161/ https://reviews.apache.org/r/10171/ https://reviews.apache.org/r/10172/ (fix)
        Hide
        bmahler Benjamin Mahler added a comment -

        This is now committed.

        Show
        bmahler Benjamin Mahler added a comment - This is now committed.

          People

          • Assignee:
            bmahler Benjamin Mahler
            Reporter:
            vinodkone Vinod Kone
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development