Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
1.8.2, 1.9.0, 1.10.0
Description
There was some connection issue with zookeeper that caused the job to restart. But shutdown failed with this fatal NPE, which seems to cause JVM to exit
2019-10-02 16:16:19,134 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0x16d83374c4206f8, likely server has clo sed socket, closing socket connection and attempting reconnect 2019-10-02 16:16:19,234 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED 2019-10-02 16:16:19,235 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2019-10-02 16:16:19,235 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2019-10-02 16:16:19,235 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp://flink@100.122.177.82:42043/u ser/dispatcher no longer participates in the leader election. 2019-10-02 16:16:19,237 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - http://100.122.177.82:8081 lost leadership 2019-10-02 16:16:19,237 INFO com.netflix.spaas.runtime.resourcemanager.TitusResourceManager - ResourceManager akka.tcp://flink@100.122.177.82:42043/user/resourcemanager was revoked leadershi p. Clearing fencing token. 2019-10-02 16:16:19,237 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/e4e68f2b3fc40c7008cca624b2a2bab0/job_ manager_lock. 2019-10-02 16:16:19,237 WARN org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not monitored (tem porarily). 2019-10-02 16:16:19,238 INFO org.apache.flink.runtime.jobmaster.JobManagerRunner - JobManager for job ksrouter (e4e68f2b3fc40c7008cca624b2a2bab0) was revoked leadership at akka.tcp: //flink@100.122.177.82:42043/user/jobmanager_0. 2019-10-02 16:16:19,239 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. 2019-10-02 16:16:19,239 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender http://100.122.177.82:8081 no longer pa rticipates in the leader election. 2019-10-02 16:16:19,239 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2019-10-02 16:16:19,239 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp://flink@100.122.177.82:42043/u ser/jobmanager_0 no longer participates in the leader election. 2019-10-02 16:16:19,239 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2019-10-02 16:16:19,239 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job ksrouter (e4e68f2b3fc40c7008cca624b2a2bab0) switched from state RUNNING to SUSPENDED. org.apache.flink.util.FlinkException: JobManager is no longer the leader. at org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeJobMasterLeadership(JobManagerRunner.java:391) at org.apache.flink.runtime.jobmaster.JobManagerRunner.lambda$revokeLeadership$5(JobManagerRunner.java:377) at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:981) at java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2124) at org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:374) at org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.notLeader(ZooKeeperLeaderElectionService.java:247) at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:640) at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:636) at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) at org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:635) at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.handleStateChange(LeaderLatch.java:623) at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.access$000(LeaderLatch.java:64) at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$1.stateChanged(LeaderLatch.java:82) at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:259) at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:255) at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) at org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:253) at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43) at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2019-10-02 16:16:19,240 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp://flink@100.122.177.82:42043/u ser/resourcemanager no longer participates in the leader election. 2019-10-02 16:16:19,239 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Dispatcher akka.tcp://flink@100.122.177.82:42043/user/dispatcher was revoked leadership. 2019-10-02 16:16:19,239 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl - Suspending the SlotManager. 2019-10-02 16:16:19,240 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Stopping all currently running jobs of dispatcher akka.tcp://flink@100.122.177.82:42043/user/dispa tcher. 2019-10-02 16:16:19,258 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping checkpoint coordinator for job e4e68f2b3fc40c7008cca624b2a2bab0. 2019-10-02 16:16:19,258 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Suspending 2019-10-02 16:16:20,076 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration se ction named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-6650646464026425406.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeepe r server allows it. 2019-10-02 16:16:20,076 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server 100.66.21.125/100.66.21.125:2181 2019-10-02 16:16:20,076 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed 2019-10-02 16:16:20,077 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to 100.66.21.125/100.66.21.125:2181, initiating session 2019-10-02 16:16:20,080 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server 100.66.21.125/100.66.21.125:2181, sessionid = 0x16d8 3374c4206f8, negotiated timeout = 40000 2019-10-02 16:16:20,080 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED 2019-10-02 16:16:20,080 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted. 2019-10-02 16:16:20,082 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are monitored again. 2019-10-02 16:16:20,082 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted. 2019-10-02 16:16:20,082 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. 2019-10-02 16:16:20,082 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted. 2019-10-02 16:16:20,082 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. 2019-10-02 16:16:20,082 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted. 2019-10-02 16:16:20,322 INFO com.netflix.spaas.runtime.resourcemanager.TitusResourceManager - ResourceManager akka.tcp://flink@100.122.177.82:42043/user/resourcemanager was granted leadershi p with fencing token 94628b472f22083fb4e611d108304613 2019-10-02 16:16:20,322 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl - Starting the SlotManager. 2019-10-02 16:16:20,326 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - http://100.122.177.82:8081 was granted leadership with leaderSessionID=69531adf-a5eb-46d2-99dc-b85 0f66e1af1 2019-10-02 16:16:20,393 INFO org.apache.flink.runtime.jobmaster.JobManagerRunner - JobManagerRunner already shutdown. 2019-10-02 16:16:20,393 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Dispatcher akka.tcp://flink@100.122.177.82:42043/user/dispatcher was granted leadership with fenci ng token abe4ffde-0a18-412b-bffa-28067ccccbeb 2019-10-02 16:16:20,393 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering all persisted jobs. 2019-10-02 16:16:20,407 INFO com.facebook.presto.s3fs.PrestoS3FileSystem - Opening path: s3://us-east-1.ksrouter.dev/recovery/4fa8-1569989613304/444/submittedJobGraphf4c0027 a08cf 2019-10-02 16:16:20,407 INFO com.facebook.presto.s3fs.PrestoS3FileSystem - Seek with new stream for s3://us-east-1.ksrouter.dev/recovery/4fa8-1569989613304/444/submittedJobG raphf4c0027a08cf to offset 0 2019-10-02 16:16:20,412 INFO com.netflix.spaas.runtime.resourcemanager.TitusResourceManager - Registering TaskManager with ResourceID 015fd9b87fc4ceb7bb1c40318db0d854 (akka.tcp://flink@100.1 22.134.166:43709/user/taskmanager_0) at ResourceManager 2019-10-02 16:16:20,412 INFO com.netflix.spaas.runtime.resourcemanager.TitusResourceReconciler - Task manager 100.122.134.166 re-registered. 2019-10-02 16:16:20,416 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - Shutting down. 2019-10-02 16:16:20,416 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job e4e68f2b3fc40c7008cca624b2a2bab0 has been suspended. 2019-10-02 16:16:20,417 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Suspending SlotPool. 2019-10-02 16:16:20,417 INFO org.apache.flink.runtime.jobmaster.JobMaster - Close ResourceManager connection dda8a9f54ec0239b7e2aa53e0a9b6174: JobManager is no longer the lea der.. 2019-10-02 16:16:20,418 INFO org.apache.flink.runtime.jobmaster.JobMaster - Stopping the JobMaster for job ksrouter(e4e68f2b3fc40c7008cca624b2a2bab0). 2019-10-02 16:16:20,420 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/e4e68f2b3fc40c7008cca624b2a2bab0/job_manager_lock'}. 2019-10-02 16:16:20,431 INFO com.netflix.spaas.runtime.resourcemanager.TitusResourceManager - Registering TaskManager with ResourceID 8d41d9e915b5fe378b3d7cd18042fcff (akka.tcp://flink@100.122.50.217:44327/user/taskmanager_0) at ResourceManager 2019-10-02 16:16:20,431 INFO com.netflix.spaas.runtime.resourcemanager.TitusResourceReconciler - Task manager 100.122.50.217 re-registered. 2019-10-02 16:16:20,542 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Recovered SubmittedJobGraph(e4e68f2b3fc40c7008cca624b2a2bab0). 2019-10-02 16:16:20,542 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Recovered SubmittedJobGraph(e4e68f2b3fc40c7008cca624b2a2bab0). 2019-10-02 16:16:20,542 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint. org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id abe4ffde-0a18-412b-bffa-28067ccccbeb. at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561) at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929) at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:88) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at akka.actor.Actor$class.aroundReceive(Actor.scala:517) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at akka.actor.ActorCell.invoke(ActorCell.scala:561) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at akka.dispatch.Mailbox.run(Mailbox.scala:225) at akka.dispatch.Mailbox.exec(Mailbox.scala:235) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.flink.runtime.dispatcher.DispatcherException: Termination of previous JobManager for job e4e68f2b3fc40c7008cca624b2a2bab0 failed. Cannot submit job under the same job id. at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$waitForTerminatingJobManager$33(Dispatcher.java:949) at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) at java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:884) at java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2196) at org.apache.flink.runtime.dispatcher.Dispatcher.waitForTerminatingJobManager(Dispatcher.java:946) at org.apache.flink.runtime.dispatcher.Dispatcher.tryAcceptLeadershipAndRunJobs(Dispatcher.java:933) at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$28(Dispatcher.java:892) at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952) at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926) ... 23 more Caused by: java.util.concurrent.CompletionException: org.apache.flink.util.FlinkException: Could not properly shut down the JobManagerRunner at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326) at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338) at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911) at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.jobmaster.JobManagerRunner.lambda$closeAsync$0(JobManagerRunner.java:207) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.postStop(AkkaRpcActor.java:132) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.postStop(FencedAkkaRpcActor.java:40) at akka.actor.Actor$class.aroundPostStop(Actor.scala:536) at akka.actor.AbstractActor.aroundPostStop(AbstractActor.scala:225) at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210) at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172) at akka.actor.ActorCell.terminate(ActorCell.scala:429) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:533) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:549) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:283) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:261) ... 6 more Caused by: org.apache.flink.util.FlinkException: Could not properly shut down the JobManagerRunner ... 22 more Caused by: org.apache.flink.runtime.rpc.akka.exceptions.AkkaRpcException: Failure while stopping RpcEndpoint jobmanager_0. at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:513) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:175) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at akka.actor.Actor$class.aroundReceive(Actor.scala:517) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at akka.actor.ActorCell.invoke(ActorCell.scala:561) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) ... 6 more Caused by: java.lang.NullPointerException at org.apache.flink.runtime.jobmaster.JobMaster.disconnectTaskManager(JobMaster.java:425) at org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:343) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:509) ... 18 more
Attachments
Issue Links
- links to