Uploaded image for project: 'Aurora'
  1. Aurora
  2. AURORA-1911

HTTP Scheduler Driver does not reliably re subscribe

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.18.0
    • None
    • None

    Description

      I observed this issue in a large production cluster during a period of Mesos Master instability:
      1. Mesos master crashes or restarts.
      2. V1Mesos driver detects this and reconnects.
      3. Aurora does the SUBSCRIBE call again.
      4. The SUBSCRIBE Call fails silently in the driver.
      5. All future calls are silently dropped by the driver.
      6. Aurora has no offers because it is not subscribed.

      Logs:

      I0328 19:40:55.473546 101404 scheduler.cpp:353] Connected with the master at http://10.162.14.30:5050/master/api/v1/scheduler
      W0328 19:40:55.475898 101410 scheduler.cpp:583] Received '503 Service Unavailable' () for SUBSCRIBE
      ....
      W0328 19:40:58.862393 101398 scheduler.cpp:508] Dropping KILL: Scheduler is in state CONNECTED
      ....
      W0328 19:41:14.588474 101394 scheduler.cpp:508] Dropping KILL: Scheduler is in state CONNECTED
      ....
      W0328 19:41:37.763464 101402 scheduler.cpp:508] Dropping KILL: Scheduler is in state CONNECTED
      ...
      

      To fix this, the VersionedSchedulerDriver needs to do two things:
      1. Block calls when unsubscribed not just disconnected.
      2. Retry the SUBSCRIBE call repeatedly with exponential backoff.

      Attachments

        Activity

          People

            zmanji Zameer Manji
            zmanji Zameer Manji
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: