We are also facing the same problem in our 3 node production cluster in one of the data centers.
3 node Zk cluster running on 3.3.6
6 node Kafka cluster on 0.8.0
~10 mirror makers for cross DC replication across ~900 partitions (that many znode entries will be created for offset storage by each mirror maker)
~10 consumers to consume data across ~900 partitions (same as above)
#both data and data log have dedicated devices
1. It's happening mostly at the beginning of an hour i.e at the 2nd or the 3rd second.
2. Some of the client (mirror makers/brokers/consumers) connections get dropped stating that server is unresponsive for 4000ms.
[06/04/2016:12:00:03 PDT] [INFO] [org.apache.zookeeper.ClientCnxn main-SendThread(<zookeeperHost>:2181)]: Client session timed out, have not heard from server in 4000ms for sessionid 0x153c64c806416a8, closing socket connection and attempting reconnect
3. It leads to EndOfStreamExceptions combined with CancelledKeyException in the Zookeeper logs.
From the discussion in the thread above, we looked at the following things:
1. We looked at the no. of established connections to each zookeeper at every second for few hours. This was done to identify whether there was a spike in the no. of new connections at the start of an hour which could cause the zookeeper queue to fill up and hence, not respond to pings.
There was no spike at the hour boundaries.
2. We looked at the the disk usage for all the zookeepers to verify whether there was a I/O spike at the hour boundaries which could be causing the fsync to delay and hence, making the server unresponsive.
No such thing was observed. Data in/out rate remained normal throughout the hour.
3. Looked at the logs of zookeeper to see whether there was anything different happening at the end of the previous hour or start of the current hour, which could explain the unresponsiveness.
No luck there as well.
4. We are using a RollingFileAppender for zookeeper logs performs hourly rolling and the archive directory share the same disk with the dataDir. To verify whether this was the cause, we changed the rolling frequency to once per day and still observed EndOfStreamExceptions at the start of every hour.
5. We looked at the GC dump using verbose:gc and the GC times are almost constant (~.13 seconds) throughout the day.
6. We are observing this behavior only in one of our data centers in spite of more data flowing in and out of the unaffected data center
1. 2 weeks back, we had 4 controller switches (Kafka) in a day for the associated Kafka cluster, all of them at the beginning of the hour. This pushed the cluster into an unhealthy state where significant no. of partitions went offline. Controller state is stored in zookeeper and we suspect this unresponsiveness to be the cause of the switch.
2. Since it's happening consistently at the beginning of the hour, we want to narrow down to the cause and work on it as we are about to get even more clients deployed. If the suggestion is to expand the ensemble, we will be happy to do that but we need some figures for devOps to justify the expansion.
Kindly advice what else we should look at.
Do tell me if you need any other info to understand the problem better.