Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-4046

Failing a restarting job can get stuck in JobStatus.FAILING

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.1.0
    • 1.1.0
    • Runtime / Coordination
    • None

    Description

      When a job is in state RESTARTING, then it can happen that all of its ExecutionJobVertices are in a final state (if they have not been reset). When calling fail on this ExecutionGraph will transition the state to FAILING and call cancel on all ExecutionJobVertices. The job state FAILING can only be left iff all ExecutionJobVertices have reached a final state. The notification of this final state is only sent to the ExecutionGraph when all subtasks of an ExecutionJobVertex have transitioned to a final state. However, this won't happen because the ExeuctionJobVertices are already in a final state. The result is that a job can get stuck in the state FAILING if fail is called on a RESTARTING job.

      I propose to add a direct transition from RESTARTING to FAILED as it is the case for the cancel call (transition from RESTARTING to CANCELED).

      Attachments

        Issue Links

          Activity

            People

              trohrmann Till Rohrmann
              trohrmann Till Rohrmann
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: