Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5040

Set correct input channel types with eager scheduling

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2.0, 1.1.4
    • Component/s: JobManager
    • Labels:
      None

      Description

      When we do eager deployment all intermediate stream/partition locations are already known when scheduling an intermediate stream/partition consumer. Nonetheless we saw tasks with "unknown input channels" that were updated lazily during runtime. This was caused by a wrong producer execution state check requiring the producers to be in RUNNING or DEPLOYING state when creating consumer input channels.

      (We had a bogus fix for this in FLINK-3232. With that "fix" we actually did not fix anything correctly and instead doubled the number of schedule or update consumer messages we sent.)

        Issue Links

          Activity

          Hide
          uce Ufuk Celebi added a comment -

          Fixed in 0d2e8b2, 2742d5c, 5d5637b (master) and b5a4cb6, 55c506f, 0bd8e02 (release-1.1).

          Show
          uce Ufuk Celebi added a comment - Fixed in 0d2e8b2, 2742d5c, 5d5637b (master) and b5a4cb6, 55c506f, 0bd8e02 (release-1.1).
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user uce closed the pull request at:

          https://github.com/apache/flink/pull/2783

          Show
          githubbot ASF GitHub Bot added a comment - Github user uce closed the pull request at: https://github.com/apache/flink/pull/2783
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/2784

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2784
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user uce opened a pull request:

          https://github.com/apache/flink/pull/2784

          [backport] FLINK-5040 [jobmanager] Set correct input channel types with eager scheduling

          Backport of #2783 with no major differences.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/uce/flink eager_deployment-backport_1.1

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2784.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2784


          commit 2790511f6cf261537da6b3f909add29092c613d0
          Author: Ufuk Celebi <uce@apache.org>
          Date: 2016-11-10T13:01:22Z

          Revert "FLINK-3232 [runtime] Add option to eagerly deploy channels"

          The reverted commit did not really fix anything, but hid the problem by
          brute force, sending many more schedule or update consumers messages.

          commit 10ffceea52b168a4819b41efb85a04a98ede0078
          Author: Ufuk Celebi <uce@apache.org>
          Date: 2016-11-10T16:25:06Z

          Revert "FLINK-3232 [runtime] Add option to eagerly deploy channels"

          The reverted commit did not really fix anything, but hid the problem by
          brute force, sending many more schedule or update consumers messages.

          commit 097c4eae2931dc8c11d8fda28a36b715f5f376a6
          Author: Ufuk Celebi <uce@apache.org>
          Date: 2016-11-09T17:25:06Z

          FLINK-5040 [jobmanager] Set correct input channel types with eager scheduling

          commit b549789956d4d1e594f2bfb642137e1f3f074b9c
          Author: Ufuk Celebi <uce@apache.org>
          Date: 2016-11-10T10:15:47Z

          FLINK-5040 [taskmanager] Adjust partition request backoffs

          The back offs were hard coded before, which would have made it
          impossible to react to any potential problems with them.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user uce opened a pull request: https://github.com/apache/flink/pull/2784 [backport] FLINK-5040 [jobmanager] Set correct input channel types with eager scheduling Backport of #2783 with no major differences. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uce/flink eager_deployment-backport_1.1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2784.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2784 commit 2790511f6cf261537da6b3f909add29092c613d0 Author: Ufuk Celebi <uce@apache.org> Date: 2016-11-10T13:01:22Z Revert " FLINK-3232 [runtime] Add option to eagerly deploy channels" The reverted commit did not really fix anything, but hid the problem by brute force, sending many more schedule or update consumers messages. commit 10ffceea52b168a4819b41efb85a04a98ede0078 Author: Ufuk Celebi <uce@apache.org> Date: 2016-11-10T16:25:06Z Revert " FLINK-3232 [runtime] Add option to eagerly deploy channels" The reverted commit did not really fix anything, but hid the problem by brute force, sending many more schedule or update consumers messages. commit 097c4eae2931dc8c11d8fda28a36b715f5f376a6 Author: Ufuk Celebi <uce@apache.org> Date: 2016-11-09T17:25:06Z FLINK-5040 [jobmanager] Set correct input channel types with eager scheduling commit b549789956d4d1e594f2bfb642137e1f3f074b9c Author: Ufuk Celebi <uce@apache.org> Date: 2016-11-10T10:15:47Z FLINK-5040 [taskmanager] Adjust partition request backoffs The back offs were hard coded before, which would have made it impossible to react to any potential problems with them.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user uce opened a pull request:

          https://github.com/apache/flink/pull/2783

          FLINK-5040 [jobmanager] Set correct input channel types with eager scheduling

          When we do eager deployment all intermediate stream/partition locations are already known when scheduling an intermediate stream/partition consumer. Nonetheless we saw tasks with "unknown input channels" that were updated lazily during runtime. This was caused by a wrong producer execution state check requiring the producers to be in RUNNING or DEPLOYING state when creating consumer input channels. This is changed in the 2nd commit.

          The 1st commit revert a bogus fix as part of FLINK-3232. With that "fix" we actually did not fix anything correctly and instead doubled the number of schedule or update consumer messages we sent.

          Furthermore (3rd commit) we change the initial and max partition request back off to 100ms and 10secs respectively. Those numbers were hard coded before. As a safety net for very slow deployments, the values can be changed via the config. No user should need to change this config value in practice.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/uce/flink eager_deployment

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2783.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2783


          commit bbbe8e9c19eb528e3e5d8e046e79298a300af556
          Author: Ufuk Celebi <uce@apache.org>
          Date: 2016-11-09T15:07:22Z

          Revert "FLINK-3232 [runtime] Add option to eagerly deploy channels"

          The reverted commit did not really fix anything, but hid the problem by
          brute force, sending many more schedule or update consumers messages.

          commit 70088f2acade2f20b8b75e18955f91793f7614c3
          Author: Ufuk Celebi <uce@apache.org>
          Date: 2016-11-09T17:25:06Z

          FLINK-5040 [jobmanager] Set correct input channel types with eager scheduling

          commit 9d186d9e42007f1144e64c802466befb858b7363
          Author: Ufuk Celebi <uce@apache.org>
          Date: 2016-11-10T10:15:47Z

          FLINK-5040 [taskmanager] Adjust partition request backoffs

          The back offs were hard coded before, which would have made it
          impossible to react to any potential problems with them.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user uce opened a pull request: https://github.com/apache/flink/pull/2783 FLINK-5040 [jobmanager] Set correct input channel types with eager scheduling When we do eager deployment all intermediate stream/partition locations are already known when scheduling an intermediate stream/partition consumer. Nonetheless we saw tasks with "unknown input channels" that were updated lazily during runtime. This was caused by a wrong producer execution state check requiring the producers to be in RUNNING or DEPLOYING state when creating consumer input channels. This is changed in the 2nd commit. The 1st commit revert a bogus fix as part of FLINK-3232 . With that "fix" we actually did not fix anything correctly and instead doubled the number of schedule or update consumer messages we sent. Furthermore (3rd commit) we change the initial and max partition request back off to 100ms and 10secs respectively. Those numbers were hard coded before. As a safety net for very slow deployments, the values can be changed via the config. No user should need to change this config value in practice. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uce/flink eager_deployment Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2783.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2783 commit bbbe8e9c19eb528e3e5d8e046e79298a300af556 Author: Ufuk Celebi <uce@apache.org> Date: 2016-11-09T15:07:22Z Revert " FLINK-3232 [runtime] Add option to eagerly deploy channels" The reverted commit did not really fix anything, but hid the problem by brute force, sending many more schedule or update consumers messages. commit 70088f2acade2f20b8b75e18955f91793f7614c3 Author: Ufuk Celebi <uce@apache.org> Date: 2016-11-09T17:25:06Z FLINK-5040 [jobmanager] Set correct input channel types with eager scheduling commit 9d186d9e42007f1144e64c802466befb858b7363 Author: Ufuk Celebi <uce@apache.org> Date: 2016-11-10T10:15:47Z FLINK-5040 [taskmanager] Adjust partition request backoffs The back offs were hard coded before, which would have made it impossible to react to any potential problems with them.

            People

            • Assignee:
              uce Ufuk Celebi
              Reporter:
              uce Ufuk Celebi
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development