Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-6078

ZooKeeper based high availability services should not close the underlying CuratorFramework

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.0
    • Fix Version/s: 1.3.0
    • Labels:
      None

      Description

      ZooKeeper based high availability tools like ZooKeeperLeaderRetrievalService and ZooKeeperLeaderElectionService expect that every instance of the services have a dedicated CuratorFramework instance assigned. Thus, they also close this CuratorFramework when the service is closed. This does not play well along with the newly introduced HighAvailabilityServices which caches a single CuratorFramework and shares it among all created services. In order to make it work properly together I propose to change the behaviour such that we no longer close the CuratorFramework clients in the ZooKeeper based services.

        Issue Links

          Activity

          Hide
          till.rohrmann Till Rohrmann added a comment -

          Fixed via ddd6a99a95b56c52ea5b5153b7270b578f5479bc

          Show
          till.rohrmann Till Rohrmann added a comment - Fixed via ddd6a99a95b56c52ea5b5153b7270b578f5479bc
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/3781

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/3781
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/3781

          Thanks for the review @StephanEwen. I will rebase the PR on the latest master and if Travis gives green light, then I'll merge it.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3781 Thanks for the review @StephanEwen. I will rebase the PR on the latest master and if Travis gives green light, then I'll merge it.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3781

          Good set of changes. I looked more deeply over the non-test code and the Kafka test code. Both look good.

          The remaining tests look like a pretty straightforward replacement of `RemoteEnvironment` by `TestEnvironment` and integration of `HaServices` into the ITCases.

          +1 from my side

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3781 Good set of changes. I looked more deeply over the non-test code and the Kafka test code. Both look good. The remaining tests look like a pretty straightforward replacement of `RemoteEnvironment` by `TestEnvironment` and integration of `HaServices` into the ITCases. +1 from my side
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/3781

          FLINK-6078 Remove CuratorFramework#close calls from ZooKeeper based HA services

          This PR is based on #3622.

          The main goal of this PR is to prevent the ZooKeeper based leader election and retrieval services from closing the underlying `CuratorFramework` instance when a election/retrieval service is closed. This will allow to share a single `CuratorFramework` instance among multiple election/retrieval services. This is a strict requirement for the Flip-6 work where all election/retrieval services are created by a `HighAvailabilityServices` implementation which shares the `CuratorFramework` among the created services. The respective changes can be found in the `ZooKeeperLeader[Election, Retrieval]Service` classes.

          In the existing code we now use as well an instance of `HighAvailabilityServices` in order to create the election/retrieval services and to manage the `CuratorFramework` instances. The respective changes are contained in `JobManager.scala:2036`, `TaskManager.scala:1643`, `MesosApplicationMasterRunner.java:299` and `YarnApplicationMasterRunner.java:343`.

          In order to create `Leader[Retrieval, Election]Services` for the `JobManager`, we need to provide a `JobID` to the `HighAvailabilityServices`. Since there is no such `JobID` defined a priori for a `JobManager`, we have introduced the `HighAvailabilityServices.DEFAULT_JOB_ID` which is to be used with the old distributed components.

          We also changed the `FlinkMiniCluster` to use the `EmbeddedHaServices` or the `ZooKeeperHaServices` in case of HA. The former service has HA like capabilities which allow to dynamically elect new leaders and notify retrievers about these changes. This allows to write better integration tests. The downside is that we can no longer connect via a `RemoteExecutionEnvironment` to a `FlinkMiniCluster`, because there is no way to obtain the current leader session id remotely. In order to execute Flink jobs on the `FlinkMiniCluster`, we have extended the `TestEnvironment` and the `TestStreamEnvironment` to be used in combination with the changed `FlinkMiniCluster`.

          Most of the remaining changes adapt test cases to use the `EmbeddedHaServices` or to work with the changed `FlinkMiniCluster` implementation.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink refactorZooKeeperServices

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3781.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3781



          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/3781 FLINK-6078 Remove CuratorFramework#close calls from ZooKeeper based HA services This PR is based on #3622. The main goal of this PR is to prevent the ZooKeeper based leader election and retrieval services from closing the underlying `CuratorFramework` instance when a election/retrieval service is closed. This will allow to share a single `CuratorFramework` instance among multiple election/retrieval services. This is a strict requirement for the Flip-6 work where all election/retrieval services are created by a `HighAvailabilityServices` implementation which shares the `CuratorFramework` among the created services. The respective changes can be found in the `ZooKeeperLeader [Election, Retrieval] Service` classes. In the existing code we now use as well an instance of `HighAvailabilityServices` in order to create the election/retrieval services and to manage the `CuratorFramework` instances. The respective changes are contained in `JobManager.scala:2036`, `TaskManager.scala:1643`, `MesosApplicationMasterRunner.java:299` and `YarnApplicationMasterRunner.java:343`. In order to create `Leader [Retrieval, Election] Services` for the `JobManager`, we need to provide a `JobID` to the `HighAvailabilityServices`. Since there is no such `JobID` defined a priori for a `JobManager`, we have introduced the `HighAvailabilityServices.DEFAULT_JOB_ID` which is to be used with the old distributed components. We also changed the `FlinkMiniCluster` to use the `EmbeddedHaServices` or the `ZooKeeperHaServices` in case of HA. The former service has HA like capabilities which allow to dynamically elect new leaders and notify retrievers about these changes. This allows to write better integration tests. The downside is that we can no longer connect via a `RemoteExecutionEnvironment` to a `FlinkMiniCluster`, because there is no way to obtain the current leader session id remotely. In order to execute Flink jobs on the `FlinkMiniCluster`, we have extended the `TestEnvironment` and the `TestStreamEnvironment` to be used in combination with the changed `FlinkMiniCluster`. Most of the remaining changes adapt test cases to use the `EmbeddedHaServices` or to work with the changed `FlinkMiniCluster` implementation. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink refactorZooKeeperServices Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3781.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3781

            People

            • Assignee:
              till.rohrmann Till Rohrmann
              Reporter:
              till.rohrmann Till Rohrmann
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development