Description
We, Spotify, have just been hit by a BUG that's related to ZkClient. It made a whole Kafka cluster go down.
Long story short, we've restarted TOR switch and all of our brokers from the cluster lost all the connectivity with the zookeeper cluster, which was living in another rack. Because of that, all the connections to Zookeeper got expired.
Everything would be fine if we haven't lost hostname to IP Address mapping for some reason. Since we did lost that mapping, we got a UnknownHostException when the broker tried to reconnect. This exception got swallowed up, and we never got reconnected to Zookeeper, which in turn made our cluster of brokers useless.
If we had "handleSessionEstablishmentError" function, the whole exception could be caught, we could just completely kill KafkaServer process and start it cleanly, since this kind of exception is fatal for the KafkaCluster.
2015-05-05T12:49:01.709+00:00 127.0.0.1 apache-kafka[main-EventThread] INFO zookeeper.ZooKeeper - Initiating client connection, connectString=zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@7303d690 2015-05-05T12:49:01.711+00:00 127.0.0.1 apache-kafka[main-EventThread] ERROR zookeeper.ClientCnxn - Error while calling watcher 2015-05-05T12:49:01.711+00:00 127.0.0.1 java.lang.RuntimeException: Exception while restarting zk client 2015-05-05T12:49:01.711+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:462) 2015-05-05T12:49:01.711+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient.process(ZkClient.java:368) 2015-05-05T12:49:01.711+00:00 127.0.0.1 at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) 2015-05-05T12:49:01.711+00:00 127.0.0.1 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 2015-05-05T12:49:01.711+00:00 127.0.0.1 Caused by: org.I0Itec.zkclient.exception.ZkException: Unable to connect to zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local 2015-05-05T12:49:01.711+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:66) 2015-05-05T12:49:01.711+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:939) 2015-05-05T12:49:01.711+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458) 2015-05-05T12:49:01.711+00:00 127.0.0.1 ... 3 more 2015-05-05T12:49:01.712+00:00 127.0.0.1 Caused by: java.net.UnknownHostException: zookeeper1.spotify.net: Name or service not known 2015-05-05T12:49:01.712+00:00 127.0.0.1 at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) 2015-05-05T12:49:01.712+00:00 127.0.0.1 at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) 2015-05-05T12:49:01.712+00:00 127.0.0.1 at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293) 2015-05-05T12:49:01.712+00:00 127.0.0.1 at java.net.InetAddress.getAllByName0(InetAddress.java:1246) 2015-05-05T12:49:01.712+00:00 127.0.0.1 at java.net.InetAddress.getAllByName(InetAddress.java:1162) 2015-05-05T12:49:01.712+00:00 127.0.0.1 at java.net.InetAddress.getAllByName(InetAddress.java:1098) 2015-05-05T12:49:01.712+00:00 127.0.0.1 at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61) 2015-05-05T12:49:01.712+00:00 127.0.0.1 at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445) 2015-05-05T12:49:01.712+00:00 127.0.0.1 at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380) 2015-05-05T12:49:01.713+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:64) 2015-05-05T12:49:01.713+00:00 127.0.0.1 ... 5 more 2015-05-05T12:49:01.713+00:00 127.0.0.1 apache-kafka[ZkClient-EventThread-18-zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local] ERROR zkclient.ZkEventThread - Error handling event ZkEvent[Children of /config/changes changed sent to kafka.server.TopicConfigManager$ConfigChangeListener$@17638f6] 2015-05-05T12:49:01.713+00:00 127.0.0.1 java.lang.NullPointerException 2015-05-05T12:49:01.713+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95) 2015-05-05T12:49:01.713+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:439) 2015-05-05T12:49:01.713+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:436) 2015-05-05T12:49:01.713+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) 2015-05-05T12:49:01.713+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:436) 2015-05-05T12:49:01.713+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:445) 2015-05-05T12:49:01.714+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:566) 2015-05-05T12:49:01.714+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) 2015-05-05T12:49:01.714+00:00 127.0.0.1 apache-kafka[main-EventThread] INFO zookeeper.ClientCnxn - EventThread shut down 2015-05-05T12:49:01.714+00:00 127.0.0.1 apache-kafka[ZkClient-EventThread-18-zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local] ERROR zkclient.ZkEventThread - Error handling event ZkEvent[Data of /controller changed sent to kafka.server.ZookeeperLeaderElector$LeaderChangeListener@18360394] 2015-05-05T12:49:01.714+00:00 127.0.0.1 java.lang.NullPointerException 2015-05-05T12:49:01.714+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95) 2015-05-05T12:49:01.714+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:439) 2015-05-05T12:49:01.714+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:436) 2015-05-05T12:49:01.714+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) 2015-05-05T12:49:01.714+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:436) 2015-05-05T12:49:01.714+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:544) 2015-05-05T12:49:01.714+00:00 127.0.0.1 at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
Attachments
Issue Links
- relates to
-
KAFKA-2169 Upgrade to zkclient-0.5
- Resolved
-
KAFKA-873 Consider replacing zkclient with curator (with zkclient-bridge)
- Resolved