Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-9788

ExecutionGraph Inconsistency prevents Job from recovering

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 1.6.0
    • 1.5.5, 1.6.2, 1.7.0
    • None
    • Rev: 4a06160
      Hadoop 2.8.3

    Description

      Deployment mode: YARN job mode with HA

      After killing many TaskManagers in succession, the state of the ExecutionGraph ran into an inconsistent state, which prevented job recovery. The following stacktrace was logged in the JobManager log several hundred times per second:

      -08 16:47:18,855 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job General purpose test job (37a794195840700b98feb23e99f7ea24) switched from state RESTARTING to RESTARTING.
      2018-07-08 16:47:18,856 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Restarting the job General purpose test job (37a794195840700b98feb23e99f7ea24).
      2018-07-08 16:47:18,857 DEBUG org.apache.flink.runtime.executiongraph.ExecutionGraph        - Resetting execution vertex Source: Custom Source -> Timestamps/Watermarks (1/10) for new execution.
      2018-07-08 16:47:18,857 WARN  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Failed to restart the job.
      java.lang.IllegalStateException: Cannot reset a vertex that is in non-terminal state CREATED
              at org.apache.flink.runtime.executiongraph.ExecutionVertex.resetForNewExecution(ExecutionVertex.java:610)
              at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:573)
              at org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1251)
              at org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestartCallback.triggerFullRecovery(ExecutionGraphRestartCallback.java:59)
              at org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy$1.run(FixedDelayRestartStrategy.java:68)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      

      The resulting jobmanager log file was 4.7 GB in size. Find attached the first 5000 lines of the log file.

      Attachments

        1. jobmanager_5000.log
          1.03 MB
          Gary Yao

        Activity

          People

            trohrmann Till Rohrmann
            gjy Gary Yao
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: