Description
Since this weekend, we are unable to upgrade scheduler. It fail as it couldn't restore replicated log from snapshot. See following traceback for details.
I couldn't replicate it in vagrant that have >2 jobs. It happen in one of our test clusters, running with 1 scheduler. I could reproduce this problem, with restoring the snapshot manually using aurora_admin scheduler_stage_recovery command. The backup size is 1.5M, if it matter. I can share the snapshot if it'll help to debug.
Scheduler works as of this commit: https://github.com/apache/incubator-aurora/commit/f2d0ca299ad759e9720a5a82547f7bce8a427d53
Scheduler DOES NOT work as of this commit: https://github.com/apache/incubator-aurora/commit/3778330cd37bc8b96907654b496c15152dfecea7
Commit delta: https://github.com/apache/incubator-aurora/compare/f2d0ca299ad759e9720a5a82547f7bce8a427d53...3778330cd37bc8b96907654b496c15152dfecea7
I suspect one of commits related to AURORA-84 have caused this regression. The StorageBackfill class was modified to fix that jira.
I1025 16:07:31.336 THREAD1 org.apache.aurora.scheduler.storage.log.LogStorage.replay: Applying snapshot taken on Sat Oct 25 16:05:29 UTC 2014 I1025 16:07:31.339 THREAD1 org.apache.aurora.scheduler.storage.log.SnapshotStoreImpl$9.execute: Restoring snapshot. I1025 16:07:31.891233 37970 leveldb.cpp:438] Reading position from leveldb took 254951ns I1025 16:07:31.893616 37976 leveldb.cpp:438] Reading position from leveldb took 110726ns I1025 16:07:31.909817 37971 leveldb.cpp:438] Reading position from leveldb took 186434ns I1025 16:07:31.913910 37971 leveldb.cpp:438] Reading position from leveldb took 81185ns I1025 16:07:31.917501 37983 leveldb.cpp:438] Reading position from leveldb took 31971ns I1025 16:07:31.919 THREAD1 org.apache.aurora.scheduler.storage.StorageBackfill.backfill: Backfilling task config job keys. I1025 16:07:32.023 THREAD1 org.apache.aurora.scheduler.storage.StorageBackfill.backfill: Performing shard uniqueness sanity check. E1025 16:07:32.076 THREAD1 org.apache.aurora.scheduler.SchedulerLifecycle$9.execute: Caught unchecked exception: java.lang.IllegalArgumentException java.lang.IllegalArgumentException at com.google.common.base.Preconditions.checkArgument(Preconditions.java:111) at org.apache.aurora.scheduler.base.JobKeys.assertValid(JobKeys.java:100) at org.apache.aurora.scheduler.base.Query$Builder.byInstances(Query.java:337) at org.apache.aurora.scheduler.base.Query.instanceScoped(Query.java:102) at org.apache.aurora.scheduler.storage.StorageBackfill.guaranteeShardUniqueness(StorageBackfill.java:74) at org.apache.aurora.scheduler.storage.StorageBackfill.access$100(StorageBackfill.java:44) at org.apache.aurora.scheduler.storage.StorageBackfill$2.apply(StorageBackfill.java:158) at org.apache.aurora.scheduler.storage.StorageBackfill$2.apply(StorageBackfill.java:151) at org.apache.aurora.scheduler.storage.mem.MemTaskStore.mutateTasks(MemTaskStore.java:194) at com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87) at org.apache.aurora.scheduler.storage.log.WriteAheadStorage.mutateTasks(WriteAheadStorage.java:196) at org.apache.aurora.scheduler.storage.StorageBackfill.backfill(StorageBackfill.java:151) at org.apache.aurora.scheduler.SchedulerLifecycle$6$1.execute(SchedulerLifecycle.java:251) at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:132) at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:149) at org.apache.aurora.scheduler.storage.log.LogStorage$7.apply(LogStorage.java:562) at org.apache.aurora.scheduler.storage.log.LogStorage$7.apply(LogStorage.java:559) at org.apache.aurora.scheduler.storage.mem.MemStorage.doWork(MemStorage.java:175) at org.apache.aurora.scheduler.storage.mem.MemStorage.access$400(MemStorage.java:59) at org.apache.aurora.scheduler.storage.mem.MemStorage$4.apply(MemStorage.java:200) at org.apache.aurora.scheduler.storage.mem.MemStorage$4.apply(MemStorage.java:197) at org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:150) at org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101) at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:197) at org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:559) at org.apache.aurora.scheduler.storage.log.LogStorage$2.execute(LogStorage.java:331) at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:132) at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:149) at org.apache.aurora.scheduler.storage.mem.MemStorage.doWork(MemStorage.java:175) at org.apache.aurora.scheduler.storage.mem.MemStorage.access$400(MemStorage.java:59) at org.apache.aurora.scheduler.storage.mem.MemStorage$4.apply(MemStorage.java:200) at org.apache.aurora.scheduler.storage.mem.MemStorage$4.apply(MemStorage.java:197) at org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:150) at org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101) at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:197) at org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:548) at org.apache.aurora.scheduler.storage.log.LogStorage.start(LogStorage.java:320) at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.start(CallOrderEnforcingStorage.java:92) at org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:248) at org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:244) at com.twitter.common.base.Closures$4.execute(Closures.java:120) at org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:423) at org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:419) at com.twitter.common.base.Closures$4.execute(Closures.java:120) at com.twitter.common.base.Closures$3.execute(Closures.java:98) at com.twitter.common.util.StateMachine.transition(StateMachine.java:191) at org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onLeading(SchedulerLifecycle.java:465) at com.twitter.common.zookeeper.SingletonService$1.onElected(SingletonService.java:168) at com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange(CandidateImpl.java:155) at com.twitter.common.zookeeper.Group$GroupMonitor.setMembers(Group.java:665) at com.twitter.common.zookeeper.Group$GroupMonitor.watchGroup(Group.java:638) at com.twitter.common.zookeeper.Group$GroupMonitor.access$900(Group.java:579) at com.twitter.common.zookeeper.Group$4.get(Group.java:551) at com.twitter.common.zookeeper.Group$4.get(Group.java:548) at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:109) at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:107) at com.twitter.common.util.BackoffHelper.doUntilResult(BackoffHelper.java:127) at com.twitter.common.util.BackoffHelper.doUntilSuccess(BackoffHelper.java:107) at com.twitter.common.zookeeper.Group.watch(Group.java:548) at com.twitter.common.zookeeper.CandidateImpl.offerLeadership(CandidateImpl.java:136) at com.twitter.common.zookeeper.SingletonService.lead(SingletonService.java:165) at org.apache.aurora.scheduler.app.SchedulerMain.run(SchedulerMain.java:253) at com.twitter.common.application.AppLauncher.run(AppLauncher.java:102) at com.twitter.common.application.AppLauncher.launch(AppLauncher.java:181) at com.twitter.common.application.AppLauncher.launch(AppLauncher.java:142) at org.apache.aurora.scheduler.app.SchedulerMain.main(SchedulerMain.java:269)
Attachments
Issue Links
- blocks
-
AURORA-711 0.6.0 Release Candidate
- Resolved