[FLINK-9788] ExecutionGraph Inconsistency prevents Job from recovering - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.6.0
Fix Version/s: 1.5.5, 1.6.2, 1.7.0
Component/s: None
Labels:
- pull-request-available
Environment:

Rev: 4a06160
Hadoop 2.8.3

Description

Deployment mode: YARN job mode with HA

After killing many TaskManagers in succession, the state of the ExecutionGraph ran into an inconsistent state, which prevented job recovery. The following stacktrace was logged in the JobManager log several hundred times per second:

-08 16:47:18,855 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job General purpose test job (37a794195840700b98feb23e99f7ea24) switched from state RESTARTING to RESTARTING.
2018-07-08 16:47:18,856 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Restarting the job General purpose test job (37a794195840700b98feb23e99f7ea24).
2018-07-08 16:47:18,857 DEBUG org.apache.flink.runtime.executiongraph.ExecutionGraph        - Resetting execution vertex Source: Custom Source -> Timestamps/Watermarks (1/10) for new execution.
2018-07-08 16:47:18,857 WARN  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Failed to restart the job.
java.lang.IllegalStateException: Cannot reset a vertex that is in non-terminal state CREATED
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.resetForNewExecution(ExecutionVertex.java:610)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:573)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1251)
        at org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestartCallback.triggerFullRecovery(ExecutionGraphRestartCallback.java:59)
        at org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy$1.run(FixedDelayRestartStrategy.java:68)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

The resulting jobmanager log file was 4.7 GB in size. Find attached the first 5000 lines of the log file.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

jobmanager_5000.log
10/Jul/18 10:20
1.03 MB
Gary Yao

Issue Links

links to

GitHub Pull Request #6810

GitHub Pull Request #6811

Activity

People

Assignee:: Till Rohrmann

Reporter:: Gary Yao

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 10/Jul/18 10:27

Updated:: 28/Feb/19 13:20

Resolved:: 11/Oct/18 15:01