Uploaded image for project: 'Aurora'
  1. Aurora
  2. AURORA-1944

Aurora is unable to elect leader after losing ZK for an extended period of time

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.19.0
    • Scheduler
    • None
    • Running on 0.17.0

    Description

      Using Apache Curator as the Zookeeper library causes an issue where Aurora is unable to elect a leader if Zookeeper loses quorum for an extended period of time.

      Scheduler seems to crash around:

      W0802 14:01:14.436 [TaskEventBatchWorker, SchedulerLifecycle] Failed to leave leadership: org.apache.aurora.common.zookeeper.SingletonService$LeaveException: Failed to abdicate leadership of group at /aurora/scheduler

      When the init system brings the scheduler back up, it is unable to elect a leader if ZK is still down.

      Specifically, the redirect monitor fails:

      E0802 14:09:37.063 [RedirectMonitor STARTING, GuavaUtils$LifecycleShutdownListener] Service: RedirectMonitor [FAILED] failed unexpectedly. Triggering shutdown.

      Leading to every scheduler showing the following:

      W0802 14:16:34.646 [qtp576711849-43, LeaderRedirect] No serviceGroupMonitor in host set, will not redirect despite not being leader.

      Once the scheduler enters this state, it is unable to snap out of it until it is manually restarted.

      Attachments

        1. aurora-0.log
          3.28 MB
          Renan DelValle
        2. aurora-1.log
          3.26 MB
          Renan DelValle
        3. aurora-2.log
          3.29 MB
          Renan DelValle

        Activity

          People

            Unassigned Unassigned
            renan Renan DelValle
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment