Uploaded image for project: 'Camel'
  1. Camel
  2. CAMEL-8208

ZooKeeperRoutePolicy is not able to recover after session expiration

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Later
    • Affects Version/s: 2.13.2
    • Fix Version/s: 2.19.0
    • Component/s: camel-zookeeper
    • Labels:
      None
    • Estimated Complexity:
      Unknown

      Description

      My company is using ZooKeeperRoutePolicy to maintain a Master/Slaver cluster. Sometimes the cluster got network problem which make the app server disconnect from remote ZooKeeper server. The disconnection usually don't last long but still long enough to expire the zookeeper session of ZooKeeperRoutePolicy. By our observation, it seems ZooKeeperRoutePolicy would not recovery and do re-election after session expiration which lead to multiple master situation.
      Is it possible to do enhancement or bug fixing on this?

        Issue Links

          Activity

          Hide
          njiang Willem Jiang added a comment -

          If the reelection happens, the old leader should be switched to salve mode.
          I just checked the code of ZooKeeperRoutePolicy, it just shutdown the consumer once the node switches from master mode to salve mode by default.
          Can you double check if the old leader switched to the salve mode?

          Show
          njiang Willem Jiang added a comment - If the reelection happens, the old leader should be switched to salve mode. I just checked the code of ZooKeeperRoutePolicy, it just shutdown the consumer once the node switches from master mode to salve mode by default. Can you double check if the old leader switched to the salve mode?
          Hide
          lwang Leo Wang added a comment -

          Hi Willem,

          The problem is ZooKeeperRoutePolicy do not re-elect the master for the route having expired session so it keeps considering itself master. However at the same time there is another route elected as master by ZooKeeper. As a result, we got two working master nodes.

          Show
          lwang Leo Wang added a comment - Hi Willem, The problem is ZooKeeperRoutePolicy do not re-elect the master for the route having expired session so it keeps considering itself master. However at the same time there is another route elected as master by ZooKeeper. As a result, we got two working master nodes.
          Hide
          njiang Willem Jiang added a comment -

          I just checked the code of ZooKeeperRoutePolicy, it doesn't switch to the not master mode if the node is disconnected.
          So I made a patch for it, can you verify it in your system?

          diff --git a/components/camel-zookeeper/src/main/java/org/apache/camel/component/zookeeper/policy/ZooKeeperElection.java b/components/camel-zookeeper/src/main/java/org/apache/camel/component/zookeeper/policy/ZooKeeperElection.java
          index 3fb3eb1..180b738 100644
          --- a/components/camel-zookeeper/src/main/java/org/apache/camel/component/zookeeper/policy/ZooKeeperElection.java
          +++ b/components/camel-zookeeper/src/main/java/org/apache/camel/component/zookeeper/policy/ZooKeeperElection.java
          @@ -236,6 +236,10 @@ public class ZooKeeperElection {
                                   LOG.debug("This node is number '{}' on the candidate list, election is configured for the top '{}'. this node will be {}",
                                           new Object[]{location, enabledCount, masterNode.get() ? "enabled" : "disabled"}
                                   );
          +                    } else {
          +                        // Cannot find the location from the candidate, we need to reset the masterNode state
          +                        LOG.info("This node {} is session expirated, so it is switch to slave mode.", candidateName);
          +                        masterNode.set(false);
                               }
                               electionComplete.countDown();
          
          Show
          njiang Willem Jiang added a comment - I just checked the code of ZooKeeperRoutePolicy, it doesn't switch to the not master mode if the node is disconnected. So I made a patch for it, can you verify it in your system? diff --git a/components/camel-zookeeper/src/main/java/org/apache/camel/component/zookeeper/policy/ZooKeeperElection.java b/components/camel-zookeeper/src/main/java/org/apache/camel/component/zookeeper/policy/ZooKeeperElection.java index 3fb3eb1..180b738 100644 --- a/components/camel-zookeeper/src/main/java/org/apache/camel/component/zookeeper/policy/ZooKeeperElection.java +++ b/components/camel-zookeeper/src/main/java/org/apache/camel/component/zookeeper/policy/ZooKeeperElection.java @@ -236,6 +236,10 @@ public class ZooKeeperElection { LOG.debug( "This node is number '{}' on the candidate list, election is configured for the top '{}'. this node will be {}" , new Object []{location, enabledCount, masterNode.get() ? "enabled" : "disabled" } ); + } else { + // Cannot find the location from the candidate, we need to reset the masterNode state + LOG.info( "This node {} is session expirated, so it is switch to slave mode." , candidateName); + masterNode.set( false ); } electionComplete.countDown();
          Hide
          lwang Leo Wang added a comment -

          I don't think this patch work for us because the internal ElectoralMonitorRoute is "dead" once session expired which means that It won't receive any Exchange message afterward. So you patched code was just not being reached.

          Actually, we've done a fix by watching the Zookeeper client if SessionExpired event received, we reset the ZooKeeperElection so that it create a new ElectoralMonitorRoute and do election again next time when isMaster() is call.

          ZooKeeperElection.java
          
          ...
              private ElectoralMonitorRoute electoralMonitorRoute;
          ...
              private ZooKeeperEndpoint createCandidateNode(CamelContext camelContext) {
                  LOG.info("Initializing ZookeeperElection with uri '{}'", uri);
                  ZooKeeperEndpoint zep = camelContext.getEndpoint(uri, ZooKeeperEndpoint.class);
                  zep.getConfiguration().setCreate(true);
                  zep.getConfiguration().setTimeout(SolviansBaseRoute.SESSION_EXPIRED_TIMEOUT);
                  String fullpath = createFullPathToCandidate(zep);
                  Exchange e = zep.createExchange();
                  e.setPattern(ExchangePattern.InOut);
                  e.getIn().setHeader(ZooKeeperMessage.ZOOKEEPER_NODE, fullpath);
                  e.getIn().setHeader(ZooKeeperMessage.ZOOKEEPER_CREATE_MODE, CreateMode.EPHEMERAL_SEQUENTIAL);
                  producerTemplate.send(zep, e);
          
                  if (e.isFailed()) {
                      LOG.error("Error setting up election node " + fullpath, e.getException());
                  } else {
                      LOG.info("Candidate node '{}' has been created", fullpath);
                      try {
                          electoralMonitorRoute = new ElectoralMonitorRoute(zep);
                          camelContext.addRoutes(electoralMonitorRoute);
                      } catch (Exception ex) {
                          LOG.error("Error configuring ZookeeperElection", ex);
                      }
                  }
                  return zep;
          
              }
          ...
              public void reset() throws Exception {
                  camelContext.removeEndpoints(uri);
                  camelContext.removeComponent("zookeeper");
                  camelContext.stopRoute(this.electoralMonitorRoute.getRouteCollection().getId());
                  camelContext.removeRoute(this.electoralMonitorRoute.getRouteCollection().getId());
                  producerTemplate.stop();
                  this.isCandidateCreated = false;
                  this.electionComplete = new CountDownLatch(1);
                  producerTemplate = camelContext.createProducerTemplate();
              }
          ...
          
          Show
          lwang Leo Wang added a comment - I don't think this patch work for us because the internal ElectoralMonitorRoute is "dead" once session expired which means that It won't receive any Exchange message afterward. So you patched code was just not being reached. Actually, we've done a fix by watching the Zookeeper client if SessionExpired event received, we reset the ZooKeeperElection so that it create a new ElectoralMonitorRoute and do election again next time when isMaster() is call. ZooKeeperElection.java ... private ElectoralMonitorRoute electoralMonitorRoute; ... private ZooKeeperEndpoint createCandidateNode(CamelContext camelContext) { LOG.info( "Initializing ZookeeperElection with uri '{}'" , uri); ZooKeeperEndpoint zep = camelContext.getEndpoint(uri, ZooKeeperEndpoint.class); zep.getConfiguration().setCreate( true ); zep.getConfiguration().setTimeout(SolviansBaseRoute.SESSION_EXPIRED_TIMEOUT); String fullpath = createFullPathToCandidate(zep); Exchange e = zep.createExchange(); e.setPattern(ExchangePattern.InOut); e.getIn().setHeader(ZooKeeperMessage.ZOOKEEPER_NODE, fullpath); e.getIn().setHeader(ZooKeeperMessage.ZOOKEEPER_CREATE_MODE, CreateMode.EPHEMERAL_SEQUENTIAL); producerTemplate.send(zep, e); if (e.isFailed()) { LOG.error( "Error setting up election node " + fullpath, e.getException()); } else { LOG.info( "Candidate node '{}' has been created" , fullpath); try { electoralMonitorRoute = new ElectoralMonitorRoute(zep); camelContext.addRoutes(electoralMonitorRoute); } catch (Exception ex) { LOG.error( "Error configuring ZookeeperElection" , ex); } } return zep; } ... public void reset() throws Exception { camelContext.removeEndpoints(uri); camelContext.removeComponent( "zookeeper" ); camelContext.stopRoute( this .electoralMonitorRoute.getRouteCollection().getId()); camelContext.removeRoute( this .electoralMonitorRoute.getRouteCollection().getId()); producerTemplate.stop(); this .isCandidateCreated = false ; this .electionComplete = new CountDownLatch(1); producerTemplate = camelContext.createProducerTemplate(); } ...
          Hide
          njiang Willem Jiang added a comment -

          Hi Leo,

          Thanks for sharing the solution with us. I think it's enough to stop and remove the selection route, you don't need to remove the endpoint and create new producerTemplate. The missing part is how to listen the SessionExpired even and call the reset method.
          Please feel free to submit a patch or a pull request, I'd happy to help you merge the patch into Apache Camel git repo.

          Regards,

          Willem

          Show
          njiang Willem Jiang added a comment - Hi Leo, Thanks for sharing the solution with us. I think it's enough to stop and remove the selection route, you don't need to remove the endpoint and create new producerTemplate. The missing part is how to listen the SessionExpired even and call the reset method. Please feel free to submit a patch or a pull request, I'd happy to help you merge the patch into Apache Camel git repo. Regards, Willem
          Hide
          davsclaus Claus Ibsen added a comment -

          Any update on this?

          Show
          davsclaus Claus Ibsen added a comment - Any update on this?
          Hide
          davsclaus Claus Ibsen added a comment -

          This is a bag of half ugly code. We should try to use Curator API that has a nicer abstraction for leader election and watching.

          Show
          davsclaus Claus Ibsen added a comment - This is a bag of half ugly code. We should try to use Curator API that has a nicer abstraction for leader election and watching.
          Hide
          davsclaus Claus Ibsen added a comment -

          And we can let the fabric8 team donate some of the code we did for master/slave for camel
          https://github.com/jboss-fuse/fabric8/tree/1.2.0.redhat-6-3-x/fabric/fabric-camel/src/main/java/io/fabric8/camel

          Show
          davsclaus Claus Ibsen added a comment - And we can let the fabric8 team donate some of the code we did for master/slave for camel https://github.com/jboss-fuse/fabric8/tree/1.2.0.redhat-6-3-x/fabric/fabric-camel/src/main/java/io/fabric8/camel
          Hide
          davsclaus Claus Ibsen added a comment -

          There is a few other camel-zookeeper tickets about the route policy stuff. We would like to rewrite it to use Curator API which is easier/better API for ZK.

          Show
          davsclaus Claus Ibsen added a comment - There is a few other camel-zookeeper tickets about the route policy stuff. We would like to rewrite it to use Curator API which is easier/better API for ZK.
          Hide
          beny23 Gerald Benischke added a comment - - edited

          Would a suitable workaround for this be to use something like:

                  ZooKeeperRoutePolicy routePolicy = new ZooKeeperRoutePolicy(zookeeperUrl, 1);
                  routePolicy.setShouldStopConsumer(false);
          
                  from("quartz2:run-test?cron=0/2+*+*+*+*+?")
                      .routePolicy(routePolicy) 
                      .onException(IllegalStateException.class)
                          .log(LoggingLevel.INFO, "Not master")
                          .handled(true)
                      .end()
                      .log(LoggingLevel.INFO, "I am master");
          

          ?

          Show
          beny23 Gerald Benischke added a comment - - edited Would a suitable workaround for this be to use something like: ZooKeeperRoutePolicy routePolicy = new ZooKeeperRoutePolicy(zookeeperUrl, 1); routePolicy.setShouldStopConsumer( false ); from( "quartz2:run-test?cron=0/2+*+*+*+*+?" ) .routePolicy(routePolicy) .onException(IllegalStateException.class) .log(LoggingLevel.INFO, "Not master" ) .handled( true ) .end() .log(LoggingLevel.INFO, "I am master" ); ?
          Hide
          davsclaus Claus Ibsen added a comment -

          There is a new zookeeper-master component that works better

          Show
          davsclaus Claus Ibsen added a comment - There is a new zookeeper-master component that works better

            People

            • Assignee:
              Unassigned
              Reporter:
              lwang Leo Wang
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development