[FLINK-14316] Stuck in "Job leader ... lost leadership" error - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.7.2
Fix Version/s: 1.9.3, 1.10.1, 1.11.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Description

This is the first exception caused restart loop. Later exceptions are the same. Job seems to stuck in this permanent failure state.

2019-10-03 21:42:46,159 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: clpevents -> device_filter -> processed_imps -> ios_processed_impression -> i
mps_ts_assigner (449/1360) (d237f5e99b6a4a580498821473763edb) switched from SCHEDULED to FAILED.
java.lang.Exception: Job leader for job id ecb9ad9be934edf7b1a4f7b9dd6df365 lost leadership.
        at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$1(TaskExecutor.java:1526)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Attachments

RpcConnection.patch
08/Oct/19 10:27
1 kB
Piyush Goyal
FLINK-14316.tgz
08/Oct/19 02:03
8.57 MB
Steven Zhen Wu

Issue Links

Add Link

is related to

FLINK-16836 Losing leadership does not clear rpc connection in JobManagerLeaderListener

Closed

Delete this link

links to

GitHub Pull Request #11603

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Till Rohrmann

Reporter:: Steven Zhen Wu

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 04/Oct/19 00:58

Updated:: 17/Apr/20 05:22

Resolved:: 02/Apr/20 12:01

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

Stuck in "Job leader ... lost leadership" error

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Time Tracking

Agile

Slack

Issue deployment