Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-2182

zkClient dies if there is any exception while reconnecting

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Implemented
    • 0.8.1
    • 0.9.0.0
    • core
    • None

    Description

      We, Spotify, have just been hit by a BUG that's related to ZkClient. It made a whole Kafka cluster go down.

      Long story short, we've restarted TOR switch and all of our brokers from the cluster lost all the connectivity with the zookeeper cluster, which was living in another rack. Because of that, all the connections to Zookeeper got expired.

      Everything would be fine if we haven't lost hostname to IP Address mapping for some reason. Since we did lost that mapping, we got a UnknownHostException when the broker tried to reconnect. This exception got swallowed up, and we never got reconnected to Zookeeper, which in turn made our cluster of brokers useless.

      If we had "handleSessionEstablishmentError" function, the whole exception could be caught, we could just completely kill KafkaServer process and start it cleanly, since this kind of exception is fatal for the KafkaCluster.

      2015-05-05T12:49:01.709+00:00 127.0.0.1 apache-kafka[main-EventThread] INFO  zookeeper.ZooKeeper  - Initiating client connection, connectString=zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@7303d690
      2015-05-05T12:49:01.711+00:00 127.0.0.1 apache-kafka[main-EventThread] ERROR zookeeper.ClientCnxn  - Error while calling watcher
      2015-05-05T12:49:01.711+00:00 127.0.0.1 java.lang.RuntimeException: Exception while restarting zk client
      2015-05-05T12:49:01.711+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:462)
      2015-05-05T12:49:01.711+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient.process(ZkClient.java:368)
      2015-05-05T12:49:01.711+00:00 127.0.0.1 at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
      2015-05-05T12:49:01.711+00:00 127.0.0.1 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
      2015-05-05T12:49:01.711+00:00 127.0.0.1 Caused by: org.I0Itec.zkclient.exception.ZkException: Unable to connect to zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local
      2015-05-05T12:49:01.711+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:66)
      2015-05-05T12:49:01.711+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:939)
      2015-05-05T12:49:01.711+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458)
      2015-05-05T12:49:01.711+00:00 127.0.0.1 ... 3 more
      2015-05-05T12:49:01.712+00:00 127.0.0.1 Caused by: java.net.UnknownHostException: zookeeper1.spotify.net: Name or service not known
      2015-05-05T12:49:01.712+00:00 127.0.0.1 at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
      2015-05-05T12:49:01.712+00:00 127.0.0.1 at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
      2015-05-05T12:49:01.712+00:00 127.0.0.1 at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
      2015-05-05T12:49:01.712+00:00 127.0.0.1 at java.net.InetAddress.getAllByName0(InetAddress.java:1246)
      2015-05-05T12:49:01.712+00:00 127.0.0.1 at java.net.InetAddress.getAllByName(InetAddress.java:1162)
      2015-05-05T12:49:01.712+00:00 127.0.0.1 at java.net.InetAddress.getAllByName(InetAddress.java:1098)
      2015-05-05T12:49:01.712+00:00 127.0.0.1 at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
      2015-05-05T12:49:01.712+00:00 127.0.0.1 at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
      2015-05-05T12:49:01.712+00:00 127.0.0.1 at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
      2015-05-05T12:49:01.713+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:64)
      2015-05-05T12:49:01.713+00:00 127.0.0.1 ... 5 more
      2015-05-05T12:49:01.713+00:00 127.0.0.1 apache-kafka[ZkClient-EventThread-18-zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local] ERROR zkclient.ZkEventThread  - Error handling event ZkEvent[Children of /config/changes changed sent to kafka.server.TopicConfigManager$ConfigChangeListener$@17638f6]
      2015-05-05T12:49:01.713+00:00 127.0.0.1 java.lang.NullPointerException
      2015-05-05T12:49:01.713+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95)
      2015-05-05T12:49:01.713+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:439)
      2015-05-05T12:49:01.713+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:436)
      2015-05-05T12:49:01.713+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
      2015-05-05T12:49:01.713+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:436)
      2015-05-05T12:49:01.713+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:445)
      2015-05-05T12:49:01.714+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:566)
      2015-05-05T12:49:01.714+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
      2015-05-05T12:49:01.714+00:00 127.0.0.1 apache-kafka[main-EventThread] INFO  zookeeper.ClientCnxn  - EventThread shut down
      2015-05-05T12:49:01.714+00:00 127.0.0.1 apache-kafka[ZkClient-EventThread-18-zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local] ERROR zkclient.ZkEventThread  - Error handling event ZkEvent[Data of /controller changed sent to kafka.server.ZookeeperLeaderElector$LeaderChangeListener@18360394]
      2015-05-05T12:49:01.714+00:00 127.0.0.1 java.lang.NullPointerException
      2015-05-05T12:49:01.714+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95)
      2015-05-05T12:49:01.714+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:439)
      2015-05-05T12:49:01.714+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:436)
      2015-05-05T12:49:01.714+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
      2015-05-05T12:49:01.714+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:436)
      2015-05-05T12:49:01.714+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:544)
      2015-05-05T12:49:01.714+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
      

      Attachments

        Issue Links

          Activity

            People

              parth.brahmbhatt Parth Brahmbhatt
              i_maravic Igor Maravić
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: