Uploaded image for project: 'Aurora'
  1. Aurora
  2. AURORA-897

aurora scheduler unable to restore from snapshot

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 0.5.0
    • 0.6.0
    • Scheduler
    • None

    Description

      Since this weekend, we are unable to upgrade scheduler. It fail as it couldn't restore replicated log from snapshot. See following traceback for details.

      I couldn't replicate it in vagrant that have >2 jobs. It happen in one of our test clusters, running with 1 scheduler. I could reproduce this problem, with restoring the snapshot manually using aurora_admin scheduler_stage_recovery command. The backup size is 1.5M, if it matter. I can share the snapshot if it'll help to debug.

      Scheduler works as of this commit: https://github.com/apache/incubator-aurora/commit/f2d0ca299ad759e9720a5a82547f7bce8a427d53
      Scheduler DOES NOT work as of this commit: https://github.com/apache/incubator-aurora/commit/3778330cd37bc8b96907654b496c15152dfecea7
      Commit delta: https://github.com/apache/incubator-aurora/compare/f2d0ca299ad759e9720a5a82547f7bce8a427d53...3778330cd37bc8b96907654b496c15152dfecea7

      I suspect one of commits related to AURORA-84 have caused this regression. The StorageBackfill class was modified to fix that jira.

      I1025 16:07:31.336 THREAD1 org.apache.aurora.scheduler.storage.log.LogStorage.replay: Applying snapshot taken on Sat Oct 25 16:05:29 UTC 2014
      I1025 16:07:31.339 THREAD1 org.apache.aurora.scheduler.storage.log.SnapshotStoreImpl$9.execute: Restoring snapshot.
      I1025 16:07:31.891233 37970 leveldb.cpp:438] Reading position from leveldb took 254951ns
      I1025 16:07:31.893616 37976 leveldb.cpp:438] Reading position from leveldb took 110726ns
      I1025 16:07:31.909817 37971 leveldb.cpp:438] Reading position from leveldb took 186434ns
      I1025 16:07:31.913910 37971 leveldb.cpp:438] Reading position from leveldb took 81185ns
      I1025 16:07:31.917501 37983 leveldb.cpp:438] Reading position from leveldb took 31971ns
      I1025 16:07:31.919 THREAD1 org.apache.aurora.scheduler.storage.StorageBackfill.backfill: Backfilling task config job keys.
      I1025 16:07:32.023 THREAD1 org.apache.aurora.scheduler.storage.StorageBackfill.backfill: Performing shard uniqueness sanity check.
      E1025 16:07:32.076 THREAD1 org.apache.aurora.scheduler.SchedulerLifecycle$9.execute: Caught unchecked exception: java.lang.IllegalArgumentException
      java.lang.IllegalArgumentException
              at com.google.common.base.Preconditions.checkArgument(Preconditions.java:111)
              at org.apache.aurora.scheduler.base.JobKeys.assertValid(JobKeys.java:100)
              at org.apache.aurora.scheduler.base.Query$Builder.byInstances(Query.java:337)
              at org.apache.aurora.scheduler.base.Query.instanceScoped(Query.java:102)
              at org.apache.aurora.scheduler.storage.StorageBackfill.guaranteeShardUniqueness(StorageBackfill.java:74)
              at org.apache.aurora.scheduler.storage.StorageBackfill.access$100(StorageBackfill.java:44)
              at org.apache.aurora.scheduler.storage.StorageBackfill$2.apply(StorageBackfill.java:158)
              at org.apache.aurora.scheduler.storage.StorageBackfill$2.apply(StorageBackfill.java:151)
              at org.apache.aurora.scheduler.storage.mem.MemTaskStore.mutateTasks(MemTaskStore.java:194)
              at com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87)
              at org.apache.aurora.scheduler.storage.log.WriteAheadStorage.mutateTasks(WriteAheadStorage.java:196)
              at org.apache.aurora.scheduler.storage.StorageBackfill.backfill(StorageBackfill.java:151)
              at org.apache.aurora.scheduler.SchedulerLifecycle$6$1.execute(SchedulerLifecycle.java:251)
              at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:132)
              at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:149)
              at org.apache.aurora.scheduler.storage.log.LogStorage$7.apply(LogStorage.java:562)
              at org.apache.aurora.scheduler.storage.log.LogStorage$7.apply(LogStorage.java:559)
              at org.apache.aurora.scheduler.storage.mem.MemStorage.doWork(MemStorage.java:175)
              at org.apache.aurora.scheduler.storage.mem.MemStorage.access$400(MemStorage.java:59)
              at org.apache.aurora.scheduler.storage.mem.MemStorage$4.apply(MemStorage.java:200)
              at org.apache.aurora.scheduler.storage.mem.MemStorage$4.apply(MemStorage.java:197)
              at org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:150)
              at org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
              at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:197)
              at org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:559)
              at org.apache.aurora.scheduler.storage.log.LogStorage$2.execute(LogStorage.java:331)
              at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:132)
              at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:149)
              at org.apache.aurora.scheduler.storage.mem.MemStorage.doWork(MemStorage.java:175)
              at org.apache.aurora.scheduler.storage.mem.MemStorage.access$400(MemStorage.java:59)
              at org.apache.aurora.scheduler.storage.mem.MemStorage$4.apply(MemStorage.java:200)
              at org.apache.aurora.scheduler.storage.mem.MemStorage$4.apply(MemStorage.java:197)
              at org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:150)
              at org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
              at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:197)
              at org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:548)
              at org.apache.aurora.scheduler.storage.log.LogStorage.start(LogStorage.java:320)
              at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.start(CallOrderEnforcingStorage.java:92)
              at org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:248)
              at org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:244)
              at com.twitter.common.base.Closures$4.execute(Closures.java:120)
              at org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:423)
              at org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:419)
              at com.twitter.common.base.Closures$4.execute(Closures.java:120)
              at com.twitter.common.base.Closures$3.execute(Closures.java:98)
              at com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
              at org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onLeading(SchedulerLifecycle.java:465)
              at com.twitter.common.zookeeper.SingletonService$1.onElected(SingletonService.java:168)
              at com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange(CandidateImpl.java:155)
              at com.twitter.common.zookeeper.Group$GroupMonitor.setMembers(Group.java:665)
              at com.twitter.common.zookeeper.Group$GroupMonitor.watchGroup(Group.java:638)
              at com.twitter.common.zookeeper.Group$GroupMonitor.access$900(Group.java:579)
              at com.twitter.common.zookeeper.Group$4.get(Group.java:551)
              at com.twitter.common.zookeeper.Group$4.get(Group.java:548)
              at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:109)
              at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:107)
              at com.twitter.common.util.BackoffHelper.doUntilResult(BackoffHelper.java:127)
              at com.twitter.common.util.BackoffHelper.doUntilSuccess(BackoffHelper.java:107)
              at com.twitter.common.zookeeper.Group.watch(Group.java:548)
              at com.twitter.common.zookeeper.CandidateImpl.offerLeadership(CandidateImpl.java:136)
              at com.twitter.common.zookeeper.SingletonService.lead(SingletonService.java:165)
              at org.apache.aurora.scheduler.app.SchedulerMain.run(SchedulerMain.java:253)
              at com.twitter.common.application.AppLauncher.run(AppLauncher.java:102)
              at com.twitter.common.application.AppLauncher.launch(AppLauncher.java:181)
              at com.twitter.common.application.AppLauncher.launch(AppLauncher.java:142)
              at org.apache.aurora.scheduler.app.SchedulerMain.main(SchedulerMain.java:269)
      
      

      Attachments

        Issue Links

          Activity

            People

              wfarner Bill Farner
              bhuvan Bhuvaneswaran A
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: