Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Major
    • Resolution: Implemented
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Labels:
      None

      Description

      This is part of the non-disruptive JobManager failure recovery.

      I suggest to add a JobStatus and ExecutionState RECONCILING.
      If a job is started on a that JobManager for master recovery (tbd how to determine that) the ExecutionGraph and the {{Execution}}s start in the reconciling state.

      From RECONCILING, tasks can go to RUNNING (execution reconciled with TaskManager) or to FAILED.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user wangzhijiang999 opened a pull request:

          https://github.com/apache/flink/pull/3113

          FLINK-4912 Introduce RECONCILIATING state in ExecutionGraph and Exe…

          This is part of the non-disruptive JobManager failure recovery.

          Add a JobStatus and ExecutionState RECONCILING.
          If a job is started on a JobManager for master recovery, the job status with all the executions transition to RECONCILING state.

          From RECONCILING, execution can go to any existing task states (execution reconciled with TaskManager).

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/wangzhijiang999/flink FLINK-4912

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3113.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3113


          commit 0fbd628b9b8817fd1b71faca92d87c56213d79f6
          Author: 淘江 <taojiang.wzj@alibaba-inc.com>
          Date: 2017-01-13T08:41:37Z

          FLINK-4912 Introduce RECONCILIATING state in ExecutionGraph and Execution for JobManager failure recovery


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user wangzhijiang999 opened a pull request: https://github.com/apache/flink/pull/3113 FLINK-4912 Introduce RECONCILIATING state in ExecutionGraph and Exe… This is part of the non-disruptive JobManager failure recovery. Add a JobStatus and ExecutionState RECONCILING . If a job is started on a JobManager for master recovery, the job status with all the executions transition to RECONCILING state. From RECONCILING , execution can go to any existing task states (execution reconciled with TaskManager). You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangzhijiang999/flink FLINK-4912 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3113.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3113 commit 0fbd628b9b8817fd1b71faca92d87c56213d79f6 Author: 淘江 <taojiang.wzj@alibaba-inc.com> Date: 2017-01-13T08:41:37Z FLINK-4912 Introduce RECONCILIATING state in ExecutionGraph and Execution for JobManager failure recovery
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user uce commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3113#discussion_r95960106

          — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/execution/ExecutionState.java —
          @@ -25,16 +25,23 @@

          • <pre> {@code * * CREATED -> SCHEDULED -> DEPLOYING -> RUNNING -> FINISHED - * | | | - * | | +------+ - * | V V - * | CANCELLING -----+----> CANCELED - * | | - * +-------------------------+ + * | | | | + * | | | +------+ + * | | V V + * | | CANCELLING -----+----> CANCELED + * | | | + * | +-------------------------+ + * | + * | ... -> FAILED + * V + * RECONCILING -> RUNNING | FINISHED | CANCELED | FAILED * - * ... -> FAILED * }

            </pre>
            *
            + * <p>It is possible to enter the

            {@code RECONCILING}

            state from

            {@code CREATED}
              • End diff –

          I think it would be best to move this paragraph below the other paragraphs below (below line 48).

          A general note: I think for Javadocs it's enough to just have the opening `<p>` tag like this:

          ```
          <p>Start here...
          ...continue and not closing tag at the end.
          ```

          Show
          githubbot ASF GitHub Bot added a comment - Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/3113#discussion_r95960106 — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/execution/ExecutionState.java — @@ -25,16 +25,23 @@ <pre> {@code * * CREATED -> SCHEDULED -> DEPLOYING -> RUNNING -> FINISHED - * | | | - * | | +------+ - * | V V - * | CANCELLING -----+----> CANCELED - * | | - * +-------------------------+ + * | | | | + * | | | +------+ + * | | V V + * | | CANCELLING -----+----> CANCELED + * | | | + * | +-------------------------+ + * | + * | ... -> FAILED + * V + * RECONCILING -> RUNNING | FINISHED | CANCELED | FAILED * - * ... -> FAILED * } </pre> * + * <p>It is possible to enter the {@code RECONCILING} state from {@code CREATED} End diff – I think it would be best to move this paragraph below the other paragraphs below (below line 48). A general note: I think for Javadocs it's enough to just have the opening `<p>` tag like this: ``` <p>Start here... ...continue and not closing tag at the end. ```
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user uce commented on the issue:

          https://github.com/apache/flink/pull/3113

          Thanks for the PR. This looks good to me. I'm not too familiar with the FLIP-6 plans though. I would wait for someone who is in on that to merge this.

          Show
          githubbot ASF GitHub Bot added a comment - Github user uce commented on the issue: https://github.com/apache/flink/pull/3113 Thanks for the PR. This looks good to me. I'm not too familiar with the FLIP-6 plans though. I would wait for someone who is in on that to merge this.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3113

          I would like to take a look at this soon...

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3113 I would like to take a look at this soon...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user wangzhijiang999 commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3113#discussion_r95975088

          — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/execution/ExecutionState.java —
          @@ -25,16 +25,23 @@

          • <pre> {@code * * CREATED -> SCHEDULED -> DEPLOYING -> RUNNING -> FINISHED - * | | | - * | | +------+ - * | V V - * | CANCELLING -----+----> CANCELED - * | | - * +-------------------------+ + * | | | | + * | | | +------+ + * | | V V + * | | CANCELLING -----+----> CANCELED + * | | | + * | +-------------------------+ + * | + * | ... -> FAILED + * V + * RECONCILING -> RUNNING | FINISHED | CANCELED | FAILED * - * ... -> FAILED * }

            </pre>
            *
            + * <p>It is possible to enter the

            {@code RECONCILING}

            state from

            {@code CREATED}
              • End diff –

          Thank you for suggestions of the format.

          Show
          githubbot ASF GitHub Bot added a comment - Github user wangzhijiang999 commented on a diff in the pull request: https://github.com/apache/flink/pull/3113#discussion_r95975088 — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/execution/ExecutionState.java — @@ -25,16 +25,23 @@ <pre> {@code * * CREATED -> SCHEDULED -> DEPLOYING -> RUNNING -> FINISHED - * | | | - * | | +------+ - * | V V - * | CANCELLING -----+----> CANCELED - * | | - * +-------------------------+ + * | | | | + * | | | +------+ + * | | V V + * | | CANCELLING -----+----> CANCELED + * | | | + * | +-------------------------+ + * | + * | ... -> FAILED + * V + * RECONCILING -> RUNNING | FINISHED | CANCELED | FAILED * - * ... -> FAILED * } </pre> * + * <p>It is possible to enter the {@code RECONCILING} state from {@code CREATED} End diff – Thank you for suggestions of the format.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3113

          Will merge this.
          To make it proper robust, I will add some tests that validate the state transitions of the state machine...

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3113 Will merge this. To make it proper robust, I will add some tests that validate the state transitions of the state machine...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3113

          Considering the possible state transitions:

            1. ExecutionState
          • `RECONCILING` can only be entered from `CREATED`

          Simple:

          • `RECONCILING` can go to `RUNNING` if the task was reconciled
          • `RECONCILING` can go to `FAILED` if the task was not reconciled

          Complex:

          • For `RECONCILING` to go to `FINISHED`, `CANCELED`, it would mean that the TaskManager that has the task would report (when registering at the JobManager) a task that is no longer executing. To do that, the TaskManager would need to "remember" tasks that completed and where it did not get an acknowledgement from the JobManager for the execution state update. Is that anticipated?
            1. JobStatus
          • `RECONCILING` can only be entered from `CREATED`

          Simple:

          • `RECONCILING` can go to `RUNNING` - if all TaskManagers report their status and tasks as running
          • `RECONCILING` can go to `FAILING` - if not all tasks were reported.

          Complex:

          • For reconciling to go to into `FINISHED`, we'd need that the `ExecutionState` can go to `FINISHED`.

          What do you think about only doing the "simple" option in the first version?

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3113 Considering the possible state transitions: ExecutionState `RECONCILING` can only be entered from `CREATED` Simple: `RECONCILING` can go to `RUNNING` if the task was reconciled `RECONCILING` can go to `FAILED` if the task was not reconciled Complex: For `RECONCILING` to go to `FINISHED`, `CANCELED`, it would mean that the TaskManager that has the task would report (when registering at the JobManager) a task that is no longer executing. To do that, the TaskManager would need to "remember" tasks that completed and where it did not get an acknowledgement from the JobManager for the execution state update. Is that anticipated? JobStatus `RECONCILING` can only be entered from `CREATED` Simple: `RECONCILING` can go to `RUNNING` - if all TaskManagers report their status and tasks as running `RECONCILING` can go to `FAILING` - if not all tasks were reported. Complex: For reconciling to go to into `FINISHED`, we'd need that the `ExecutionState` can go to `FINISHED`. What do you think about only doing the "simple" option in the first version?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user wangzhijiang999 commented on the issue:

          https://github.com/apache/flink/pull/3113

          @StephanEwen , thank you for the concrete suggestions. Sorry for delay response because of Chinese Spring Festival Holiday.

          I have considered and added some tests to validate the state transitions of the state machine related with the later processes which would be submitted in the following PRs together.

          I totally agree with the consideration of the above possible state transitions. And I plan to give a detail explanation of my implementation in another jira soon. It is actually a bit complex to do that ,so I try to break them down into small ones in order to review and merge quickly.

          Show
          githubbot ASF GitHub Bot added a comment - Github user wangzhijiang999 commented on the issue: https://github.com/apache/flink/pull/3113 @StephanEwen , thank you for the concrete suggestions. Sorry for delay response because of Chinese Spring Festival Holiday. I have considered and added some tests to validate the state transitions of the state machine related with the later processes which would be submitted in the following PRs together. I totally agree with the consideration of the above possible state transitions. And I plan to give a detail explanation of my implementation in another jira soon. It is actually a bit complex to do that ,so I try to break them down into small ones in order to review and merge quickly.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user wangzhijiang999 commented on the issue:

          https://github.com/apache/flink/pull/3113

          @StephanEwen , I already created #5703 for further detail recovery process and it may cover your considerations. Wish your further response, thank you!

          Show
          githubbot ASF GitHub Bot added a comment - Github user wangzhijiang999 commented on the issue: https://github.com/apache/flink/pull/3113 @StephanEwen , I already created #5703 for further detail recovery process and it may cover your considerations. Wish your further response, thank you!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3113

          Given that this only extends the enum and does not add changes to the state transitions, we can merge this.

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3113 Given that this only extends the enum and does not add changes to the state transitions, we can merge this.
          Hide
          StephanEwen Stephan Ewen added a comment -

          Implemented in cc27f0803f4ed1d9799594c75ac00d0e14447479

          Show
          StephanEwen Stephan Ewen added a comment - Implemented in cc27f0803f4ed1d9799594c75ac00d0e14447479
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/3113

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/3113

            People

            • Assignee:
              zjwang zhijiang
              Reporter:
              StephanEwen Stephan Ewen
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development