Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-3814

ZooKeeper config propagates even with disabled dynamic reconfig

    XMLWordPrintableJSON

Details

    Description

      Hello,

      We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. Encountered no issues as such.

      This is how the ZooKeeper config looks like:

      tickTime=2000
      dataDir=/zookeeper-data/
      initLimit=5
      syncLimit=2
      maxClientCnxns=2048
      autopurge.snapRetainCount=3
      autopurge.purgeInterval=1
      4lw.commands.whitelist=stat, ruok, conf, isro, mntr
      authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
      requireClientAuthScheme=sasl
      quorum.cnxn.threads.size=20
      quorum.auth.enableSasl=true
      quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
      quorum.auth.learnerRequireSasl=true
      quorum.auth.learner.saslLoginContext=QuorumLearner
      quorum.auth.serverRequireSasl=true
      quorum.auth.server.saslLoginContext=QuorumServer
      server.17=node1.foo.bar.com:2888:3888;2181
      server.19=node2.foo.bar.com:2888:3888;2181
      server.20=node3.foo.bar.com:2888:3888;2181
      server.21=node4.foo.bar.com:2888:3888;2181
      server.22=node5.bar.com:2888:3888;2181

      Post upgrade, we had to migrate server.22 on the same node, but with FOO.bar.com domain name due to kerberos referral issues. And, we used different server-identifier, i.e., 23 when we migrated. So, here is how the new config looked like:

      server.17=node1.foo.bar.com:2888:3888;2181
      server.19=node2.foo.bar.com:2888:3888;2181
      server.20=node3.foo.bar.com:2888:3888;2181
      server.21=node4.foo.bar.com:2888:3888;2181
      server.23=node5.foo.bar.com:2888:3888;2181

      We restarted all the nodes in the ensemble with the above updated config. And the migrated node joined the quorum successfully and was serving all clients directly connected to it, without any issues.

      Recently, when a leader election happened, server.23=node5.foo.bar.com(migrated node) was chosen as Leader (as it has highest ID). But then, ZooKeeper was unable to serve any clients and all the servers were somehow still trying to establish a channel to 22 (old DNS name: node5.bar.com) and were throwing below error in a loop:

      2020-05-02 01:43:03,026 [myid:23] - WARN [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve address: node4.bar.com
      java.net.UnknownHostException: node5.bar.com: Name or service not known
      {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}}
      {{ at java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
      {{ at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
      {{ at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
      {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}}
      {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
      {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}}
      {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}}
      {{ at org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}}
      {{ at org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}}
      {{ at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}}
      {{ at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
      {{ at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
      {{ at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
      {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
      2020-05-02 01:43:03,026 [myid:23] - WARN [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at election address node5.bar.com:3888
      java.net.UnknownHostException: node5.bar.com
      {{ at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)}}
      {{ at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403)}}
      {{ at java.base/java.net.Socket.connect(Socket.java:591)}}
      {{ at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:650)}}
      {{ at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:714)}}
      {{ at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
      {{ at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
      {{ at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
      {{ at java.base/java.lang.Thread.run(Thread.java:834)}}

      Fetching config from live ZooKeeper znode also doesn't show "22" being a member of the ensemble. Its not clear how "22" is still coming into the picture.

      In [4]: zk.get('/zookeeper/config')
      Out[4]:
      ('server.17=node1.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n

      server.19=node2.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n

      server.20=node3.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n

      server.21=node4.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n

      server.23=node5.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n

      version=0',
      ZnodeStat(czxid=0, mzxid=0, ctime=0, mtime=1588399290245, version=-1, cversion=0, aversion=-1, ephemeralOwner=0, dataLength=360, numChildren=0, pzxid=0))

      We suspected some weird caching issue and restarted ZooKeeper across all the nodes but that didn't help. So, whenever node5 becomes the Leader, ID:22 is popping up. We even rebooted node5 and that hasn't helped too.

      We also looked at '/zookeeper/config' content from snapshot files and did not find any reference to ID:22.

      Any help would be greatly appreciated.

      NOTE: dynamic config is disabled.

      Thanks,
      Rajkiran

      Attachments

        Issue Links

          Activity

            People

              symat Mate Szalay-Beko
              rajsura Rajkiran Sura
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: