Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-7231

SlotSharingGroups are not always released in time for new restarts

    Details

      Description

      In the case where there are not enough resources to schedule the streaming program, a race condition can lead to a sequence of the following errors:

      java.lang.IllegalStateException: SlotSharingGroup cannot clear task assignment, group still has allocated resources.
      

      This eventually recovers, but may involve many fast restart attempts before doing so.

      The root cause is that slots are not cleared before the next restart attempt.

        Issue Links

          Activity

          Hide
          aljoscha Aljoscha Krettek added a comment -

          Fixed in

          • 1.4.0 via 605319b550aeba5612b0e32fa193521081b7adc5
          • 1.3.2 via 39f5b1144167dcb80e8708f4cb5426e76f648026
          Show
          aljoscha Aljoscha Krettek added a comment - Fixed in 1.4.0 via 605319b550aeba5612b0e32fa193521081b7adc5 1.3.2 via 39f5b1144167dcb80e8708f4cb5426e76f648026
          Hide
          aljoscha Aljoscha Krettek added a comment -

          Reopen to fix release note.

          Show
          aljoscha Aljoscha Krettek added a comment - Reopen to fix release note.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen closed the pull request at:

          https://github.com/apache/flink/pull/4370

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen closed the pull request at: https://github.com/apache/flink/pull/4370
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/4370

          Thanks for the review!
          Merging this...

          The commit by Niko was probably because github was slightly out of sync with the apache git repo and thought that commit was part of the diff...

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/4370 Thanks for the review! Merging this... The commit by Niko was probably because github was slightly out of sync with the apache git repo and thought that commit was part of the diff...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on a diff in the pull request:

          https://github.com/apache/flink/pull/4370#discussion_r128514948

          — Diff: flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphRestartTest.java —
          @@ -661,10 +682,117 @@ public void testConcurrentGlobalFailAndRestarts() throws Exception {
          }
          }

          + @Test
          + public void testRestartWithEagerSchedulingAndSlotSharing() throws Exception {
          — End diff –

          Exactly, it extends test coverage to possible related conditions.
          I realized that we had no test about the "happy path" of eager scheduling and slot sharing as well.

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on a diff in the pull request: https://github.com/apache/flink/pull/4370#discussion_r128514948 — Diff: flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphRestartTest.java — @@ -661,10 +682,117 @@ public void testConcurrentGlobalFailAndRestarts() throws Exception { } } + @Test + public void testRestartWithEagerSchedulingAndSlotSharing() throws Exception { — End diff – Exactly, it extends test coverage to possible related conditions. I realized that we had no test about the "happy path" of eager scheduling and slot sharing as well.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user aljoscha commented on a diff in the pull request:

          https://github.com/apache/flink/pull/4370#discussion_r128497552

          — Diff: flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphRestartTest.java —
          @@ -661,10 +682,117 @@ public void testConcurrentGlobalFailAndRestarts() throws Exception {
          }
          }

          + @Test
          + public void testRestartWithEagerSchedulingAndSlotSharing() throws Exception {
          — End diff –

          Does this test just further verify existing behaviour? I'm asking because it doesn't fail when changing the order of slot release and restart back to the original order.

          Show
          githubbot ASF GitHub Bot added a comment - Github user aljoscha commented on a diff in the pull request: https://github.com/apache/flink/pull/4370#discussion_r128497552 — Diff: flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphRestartTest.java — @@ -661,10 +682,117 @@ public void testConcurrentGlobalFailAndRestarts() throws Exception { } } + @Test + public void testRestartWithEagerSchedulingAndSlotSharing() throws Exception { — End diff – Does this test just further verify existing behaviour? I'm asking because it doesn't fail when changing the order of slot release and restart back to the original order.
          Hide
          aljoscha Aljoscha Krettek added a comment -

          Stephan Ewen this is also a blocker? Asking because fixVersion is 1.3.2.

          Show
          aljoscha Aljoscha Krettek added a comment - Stephan Ewen this is also a blocker? Asking because fixVersion is 1.3.2.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user StephanEwen opened a pull request:

          https://github.com/apache/flink/pull/4370

          FLINK-7231 [distr. coordination] Fix slot release affecting SlotSharingGroup cleanup

          *This is base on #4364 so only the last commit is relevant*

            1. What is the purpose of the change

          This fixes FLINK-7231(https://issues.apache.org/jira/browse/FLINK-7231) - a bug making restarts unstable in the presence of certain combination of slot sharing, losses of TaskManagers, and restart strategies.

            1. Brief change log
          • Minimal adjustment in `ExecutionGraph`: On failed resource acquisition, release slots (and with that sharing group assignments) before triggering the recovery. Before this change, both happened concurrently/asynchronously (and recovery may have overtaken slot release).
            1. Verifying this change

          This change adds additional unit tests:

          • `ExecutionGraphRestartTest#testRestartWithEagerSchedulingAndSlotSharing()`
          • `ExecutionGraphRestartTest#testRestartWithSlotSharingAndNotEnoughResources()`

          The effect (and fix) can also be observed by repeatedly trying the following:

          • Create a streaming job with multiple JobVertices
          • Set the restart strategy to fixed-delay with zero delay
          • Run the job
          • Repeat: Kill TaskManager and bring up recovery TaskManager. There is a good chance that various restarts are affected by `java.lang.IllegalStateException: SlotSharingGroup cannot clear task assignment, group still has allocated resources.`, meaning they take long before actually recovering.
            1. Does this pull request potentially affect one of the following parts:
          • Dependencies (does it add or upgrade a dependency): *no*
          • The public API, i.e., is any changed class annotated with `@Public(Evolving)`: *(no*
          • The serializers: *no*
          • The runtime per-record code paths (performance sensitive): *no*
          • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: *yes*
            1. Documentation
          • Does this pull request introduce a new feature? *no*
          • If yes, how is the feature documented? *not applicable*

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/StephanEwen/incubator-flink sharing_group_bug

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/4370.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #4370


          commit f055645b3d905ea212b11eb570926d46447f3f52
          Author: zjureel <zjureel@gmail.com>
          Date: 2017-07-18T17:27:56Z

          FLINK-6665 FLINK-6667 [distributed coordination] Use a callback and a ScheduledExecutor for ExecutionGraph restarts

          Initial work by zjureel@gmail.com , improved by sewen@apache.org.

          commit 11e2144892a57c58ffe919ac228c702595f34025
          Author: Stephan Ewen <sewen@apache.org>
          Date: 2017-07-18T17:49:56Z

          FLINK-7216 [distr. coordination] Guard against concurrent global failover

          commit 16e9e133e0ed9dfba2d177c8f789f1b215a7759e
          Author: Stephan Ewen <sewen@apache.org>
          Date: 2017-07-19T08:24:52Z

          FLINK-7231 [distr. coordination] Fix slot release affecting SlotSharingGroup cleanup


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user StephanEwen opened a pull request: https://github.com/apache/flink/pull/4370 FLINK-7231 [distr. coordination] Fix slot release affecting SlotSharingGroup cleanup * This is base on #4364 so only the last commit is relevant * What is the purpose of the change This fixes FLINK-7231 ( https://issues.apache.org/jira/browse/FLINK-7231 ) - a bug making restarts unstable in the presence of certain combination of slot sharing, losses of TaskManagers, and restart strategies. Brief change log Minimal adjustment in `ExecutionGraph`: On failed resource acquisition, release slots (and with that sharing group assignments) before triggering the recovery. Before this change, both happened concurrently/asynchronously (and recovery may have overtaken slot release). Verifying this change This change adds additional unit tests: `ExecutionGraphRestartTest#testRestartWithEagerSchedulingAndSlotSharing()` `ExecutionGraphRestartTest#testRestartWithSlotSharingAndNotEnoughResources()` The effect (and fix) can also be observed by repeatedly trying the following: Create a streaming job with multiple JobVertices Set the restart strategy to fixed-delay with zero delay Run the job Repeat: Kill TaskManager and bring up recovery TaskManager. There is a good chance that various restarts are affected by `java.lang.IllegalStateException: SlotSharingGroup cannot clear task assignment, group still has allocated resources.`, meaning they take long before actually recovering. Does this pull request potentially affect one of the following parts: Dependencies (does it add or upgrade a dependency): * no * The public API, i.e., is any changed class annotated with `@Public(Evolving)`: * (no * The serializers: * no * The runtime per-record code paths (performance sensitive): * no * Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: * yes * Documentation Does this pull request introduce a new feature? * no * If yes, how is the feature documented? * not applicable * You can merge this pull request into a Git repository by running: $ git pull https://github.com/StephanEwen/incubator-flink sharing_group_bug Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/4370.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4370 commit f055645b3d905ea212b11eb570926d46447f3f52 Author: zjureel <zjureel@gmail.com> Date: 2017-07-18T17:27:56Z FLINK-6665 FLINK-6667 [distributed coordination] Use a callback and a ScheduledExecutor for ExecutionGraph restarts Initial work by zjureel@gmail.com , improved by sewen@apache.org. commit 11e2144892a57c58ffe919ac228c702595f34025 Author: Stephan Ewen <sewen@apache.org> Date: 2017-07-18T17:49:56Z FLINK-7216 [distr. coordination] Guard against concurrent global failover commit 16e9e133e0ed9dfba2d177c8f789f1b215a7759e Author: Stephan Ewen <sewen@apache.org> Date: 2017-07-19T08:24:52Z FLINK-7231 [distr. coordination] Fix slot release affecting SlotSharingGroup cleanup

            People

            • Assignee:
              StephanEwen Stephan Ewen
              Reporter:
              StephanEwen Stephan Ewen
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development