Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
Description
If SchedulerLifecycle encounters a RuntimeException while initializing storage, it takes no action to abort. The result is a leader in ZK that will never make progress and requires human intervention (killing the process).
It would be prudent to consider a sweeping improvement in the course of fixing this, such as initiating a shutdown on any uncaught exception when transitioning in SchedulerLifecycle.
E0117 09:04:17.426 THREAD21 org.apache.zookeeper.ClientCnxn$EventThread.processEvent: Error while calling watcher org.apache.aurora.scheduler.storage.log.LogStorage$RecoveryFailedException: org.apache.aurora.scheduler.log.Log$Stream$StreamAccessException: Problem reading from log at org.apache.aurora.scheduler.storage.log.LogStorage.recover(LogStorage.java:329) at com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87) at org.apache.aurora.scheduler.storage.log.LogStorage$2.execute(LogStorage.java:303) at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:138) at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:155) at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:146) at com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87) at org.apache.aurora.scheduler.storage.ForwardingStore.write(ForwardingStore.java:105) at org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:475) at org.apache.aurora.scheduler.storage.log.LogStorage.start(LogStorage.java:298) at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.start(CallOrderEnforcingStorage.java:94) at org.apache.aurora.scheduler.SchedulerLifecycle$5.execute(SchedulerLifecycle.java:240) at org.apache.aurora.scheduler.SchedulerLifecycle$5.execute(SchedulerLifecycle.java:237) at com.twitter.common.base.Closures$4.execute(Closures.java:120) at com.twitter.common.base.Closures$4.execute(Closures.java:120) at com.twitter.common.base.Closures$3.execute(Closures.java:98) at com.twitter.common.util.StateMachine.transition(StateMachine.java:191) at org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onLeading(SchedulerLifecycle.java:446) at com.twitter.common.zookeeper.SingletonService$1.onElected(SingletonService.java:168) at com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange(CandidateImpl.java:155) at com.twitter.common.zookeeper.Group$GroupMonitor.setMembers(Group.java:665) at com.twitter.common.zookeeper.Group$GroupMonitor.watchGroup(Group.java:638) at com.twitter.common.zookeeper.Group$GroupMonitor.access$900(Group.java:579) at com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:600) at com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:597) at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:109) at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:107) at com.twitter.common.util.BackoffHelper.doUntilResult(BackoffHelper.java:127) at com.twitter.common.util.BackoffHelper.doUntilSuccess(BackoffHelper.java:107) at com.twitter.common.zookeeper.Group$GroupMonitor.tryWatchGroup(Group.java:622) at com.twitter.common.zookeeper.Group$GroupMonitor.access$1100(Group.java:579) at com.twitter.common.zookeeper.Group$GroupMonitor$1.process(Group.java:591) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) Caused by: org.apache.aurora.scheduler.log.Log$Stream$StreamAccessException: Problem reading from log at org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream$2.hasNext(MesosLog.java:255) at org.apache.aurora.scheduler.storage.log.LogManager$StreamManager.readFromBeginning(LogManager.java:190) at org.apache.aurora.scheduler.storage.log.LogStorage.recover(LogStorage.java:323) ... 33 more Caused by: org.apache.mesos.Log$OperationFailedException: Bad read range (includes pending entries) at org.apache.mesos.Log$Reader.read(Native Method) at org.apache.aurora.scheduler.log.mesos.MesosLogStreamModule$4.read(MesosLogStreamModule.java:168) at org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream$2.hasNext(MesosLog.java:233) ... 35 more