Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-3577

Partial cluster breakdown

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 0.9.0.1
    • None
    • None
    • None
    • Debian GNU/Linux 7.9 (wheezy)

    Description

      We run a cluster of 3 brokers and 3 zookeepers, but after we upgraded to 0.9.0.1 our cluster sometimes goes partially down, and we can't figure why. A full cluster restart fixed the problem.
      I've added a snippet of the logs on each broker below.

      Broker 4:

      [2016-04-18 05:58:26,390] INFO [Group Metadata Manager on Broker 4]: Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager)
      [2016-04-18 06:05:55,218] INFO Creating /controller (is it secure? false) (kafka.utils.ZKCheckedEphemeral)
      [2016-04-18 06:05:57,396] ERROR Session has expired while creating /controller (kafka.utils.ZKCheckedEphemeral)
      [2016-04-18 06:05:57,396] INFO Result of znode creation is: SESSIONEXPIRED (kafka.utils.ZKCheckedEphemeral)
      [2016-04-18 06:05:57,400] ERROR Error while electing or becoming leader on broker 4 (kafka.server.ZookeeperLeaderElector)
      org.I0Itec.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
      at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:68)
      at kafka.utils.ZKCheckedEphemeral.create(ZkUtils.scala:1090)
      at kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:81)
      at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
      at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
      at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
      at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
      at kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
      at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:823)
      at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
      Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
      ... 9 more
      [2016-04-18 06:05:57,420] INFO Creating /controller (is it secure? false) (kafka.utils.ZKCheckedEphemeral)
      [2016-04-18 06:05:57,424] INFO Result of znode creation is: OK (kafka.utils.ZKCheckedEphemeral)
      [2016-04-18 06:05:57,425] INFO 4 successfully elected as leader (kafka.server.ZookeeperLeaderElector)
      [2016-04-18 06:05:57,885] INFO [ReplicaFetcherManager on broker 4] Removed fetcher for partitions [__consumer_offsets,32],[__consumer_offsets,44],[cdrrecords-errors,1],[cdrrecords,0],[__consumer_offsets,38],[__consumer_offsets,8],[events
      ,2],[__consumer_offsets,20],[__consumer_offsets,2],[__consumer_offsets,14],[__consumer_offsets,26] (kafka.server.ReplicaFetcherManager)
      [2016-04-18 06:05:57,892] INFO [ReplicaFetcherManager on broker 4] Removed fetcher for partitions [__consumer_offsets,35],[__consumer_offsets,23],[__consumer_offsets,47],[__consumer_offsets,11],[__consumer_offsets,5],[events-errors,2],[_
      _consumer_offsets,17],[__consumer_offsets,41],[__consumer_offsets,29] (kafka.server.ReplicaFetcherManager)
      [2016-04-18 06:05:57,894] INFO Truncating log __consumer_offsets-17 to offset 0. (kafka.log.Log)
      [2016-04-18 06:05:57,894] INFO Truncating log __consumer_offsets-23 to offset 0. (kafka.log.Log)
      [2016-04-18 06:05:57,894] INFO Truncating log __consumer_offsets-29 to offset 0. (kafka.log.Log)
      [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-35 to offset 0. (kafka.log.Log)
      [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-41 to offset 0. (kafka.log.Log)
      [2016-04-18 06:05:57,895] INFO Truncating log events-errors-2 to offset 0. (kafka.log.Log)
      [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-5 to offset 0. (kafka.log.Log)
      [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-11 to offset 0. (kafka.log.Log)
      [2016-04-18 06:05:57,896] INFO Truncating log __consumer_offsets-47 to offset 0. (kafka.log.Log)
      [2016-04-18 06:05:57,904] INFO [ReplicaFetcherManager on broker 4] Added fetcher for partitions List([[__consumer_offsets,17], initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , [[__consumer_offsets,23], initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , [[__consumer_offsets,29], initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , [[__consumer_offsets,35], initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , [[__consumer_offsets,41], initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , [[events-errors,2], initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , [[__consumer_offsets,5], initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , [[__consumer_offsets,11], initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , [[__consumer_offsets,47], initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] ) (kafka.server.ReplicaFetcherManager)
      [2016-04-18 06:05:57,908] INFO [ReplicaFetcherThread-0-6], Shutting down (kafka.server.ReplicaFetcherThread)
      [2016-04-18 06:05:58,783] INFO re-registering broker info in ZK for broker 4 (kafka.server.KafkaHealthcheck)
      [2016-04-18 06:05:58,784] INFO Creating /brokers/ids/4 (is it secure? false) (kafka.utils.ZKCheckedEphemeral)
      [2016-04-18 06:05:58,799] INFO Result of znode creation is: OK (kafka.utils.ZKCheckedEphemeral)
      [2016-04-18 06:05:58,799] INFO Registered broker 4 at path /brokers/ids/4 with addresses: PLAINTEXT -> EndPoint(kafka303.prod.local,9092,PLAINTEXT) (kafka.utils.ZkUtils)
      [2016-04-18 06:05:58,799] INFO done re-registering broker (kafka.server.KafkaHealthcheck)
      [2016-04-18 06:05:58,799] INFO Subscribing to /brokers/topics path to watch for new topics (kafka.server.KafkaHealthcheck)
      [2016-04-18 06:05:58,804] INFO New leader is 4 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
      [2016-04-18 06:05:58,807] INFO New leader is 4 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
      [2016-04-18 06:06:03,895] INFO [ReplicaFetcherThread-0-6], Stopped (kafka.server.ReplicaFetcherThread)
      [2016-04-18 06:06:03,895] INFO [ReplicaFetcherThread-0-6], Shutdown completed (kafka.server.ReplicaFetcherThread)

      Broker 5:

      [2016-04-18 05:57:05,926] INFO [Group Metadata Manager on Broker 5]: Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager)
      [2016-04-18 06:05:56,304] INFO Creating /controller (is it secure? false) (kafka.utils.ZKCheckedEphemeral)
      [2016-04-18 06:05:56,309] INFO Result of znode creation is: OK (kafka.utils.ZKCheckedEphemeral)
      [2016-04-18 06:05:56,312] INFO 5 successfully elected as leader (kafka.server.ZookeeperLeaderElector)
      [2016-04-18 06:05:58,545] INFO [ReplicaFetcherManager on broker 5] Removed fetcher for partitions [__consumer_offsets,17],[__consumer_offsets,23],[__consumer_offsets,29],[__consumer_offsets,35],[__consumer_offsets,41],[events-errors,2],[
      __consumer_offsets,5],[__consumer_offsets,11],[__consumer_offsets,47] (kafka.server.ReplicaFetcherManager)
      [2016-04-18 06:05:58,554] INFO [ReplicaFetcherManager on broker 5] Removed fetcher for partitions [__consumer_offsets,2],[__consumer_offsets,38],[__consumer_offsets,44],[events,2],[__consumer_offsets,32],[cdrrecords-errors,1],[__consumer
      _offsets,8],[__consumer_offsets,20],[cdrrecords,0],[__consumer_offsets,14],[__consumer_offsets,26] (kafka.server.ReplicaFetcherManager)
      [2016-04-18 06:05:58,556] INFO Truncating log __consumer_offsets-32 to offset 0. (kafka.log.Log)
      [2016-04-18 06:05:58,556] INFO Truncating log __consumer_offsets-44 to offset 0. (kafka.log.Log)
      [2016-04-18 06:05:58,556] INFO Truncating log cdrrecords-errors-1 to offset 8. (kafka.log.Log)
      [2016-04-18 06:05:58,556] INFO Truncating log cdrrecords-0 to offset 172892. (kafka.log.Log)
      [2016-04-18 06:05:58,557] INFO Truncating log __consumer_offsets-38 to offset 0. (kafka.log.Log)
      [2016-04-18 06:05:58,557] INFO Truncating log __consumer_offsets-8 to offset 0. (kafka.log.Log)
      [2016-04-18 06:05:58,557] INFO Truncating log events-2 to offset 29954. (kafka.log.Log)
      [2016-04-18 06:05:58,557] INFO Truncating log __consumer_offsets-20 to offset 0. (kafka.log.Log)
      [2016-04-18 06:05:58,558] INFO Truncating log __consumer_offsets-2 to offset 0. (kafka.log.Log)
      [2016-04-18 06:05:58,558] INFO Truncating log __consumer_offsets-14 to offset 0. (kafka.log.Log)
      [2016-04-18 06:05:58,558] INFO Truncating log __consumer_offsets-26 to offset 0. (kafka.log.Log)
      [2016-04-18 06:05:58,565] INFO [ReplicaFetcherManager on broker 5] Added fetcher for partitions List([[__consumer_offsets,32], initOffset 0 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,44], initOffset 0 to
      broker BrokerEndPoint(4,kafka303.prod.local,9092)] , [[cdrrecords-errors,1], initOffset 8 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] , [[cdrrecords,0], initOffset 172892 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] ,
      [[__consumer_offsets,38], initOffset 0 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,8], initOffset 0 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] , [[events,2], initOffset 29954 to broker BrokerE
      ndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,20], initOffset 0 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,2], initOffset 0 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consu
      mer_offsets,14], initOffset 0 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,26], initOffset 0 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] ) (kafka.server.ReplicaFetcherManager)
      [2016-04-18 06:05:58,568] INFO [ReplicaFetcherThread-0-6], Shutting down (kafka.server.ReplicaFetcherThread)
      [2016-04-18 06:05:58,613] INFO New leader is 4 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
      [2016-04-18 06:06:04,571] INFO [ReplicaFetcherThread-0-6], Stopped (kafka.server.ReplicaFetcherThread)
      [2016-04-18 06:06:04,571] INFO [ReplicaFetcherThread-0-6], Shutdown completed (kafka.server.ReplicaFetcherThread)

      Broker 6:

      [2016-04-18 05:57:12,699] INFO [Group Metadata Manager on Broker 6]: Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager)
      [2016-04-18 06:05:55,281] INFO Partition [__consumer_offsets,47] on broker 6: Shrinking ISR for partition [__consumer_offsets,47] from 5,6,4 to 6 (kafka.cluster.Partition)
      [2016-04-18 06:06:03,656] INFO re-registering broker info in ZK for broker 6 (kafka.server.KafkaHealthcheck)
      [2016-04-18 06:06:03,691] INFO Creating /brokers/ids/6 (is it secure? false) (kafka.utils.ZKCheckedEphemeral)
      [2016-04-18 06:06:03,696] INFO Result of znode creation is: OK (kafka.utils.ZKCheckedEphemeral)
      [2016-04-18 06:06:03,696] INFO Partition [__consumer_offsets,47] on broker 6: Cached zkVersion [90] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
      [2016-04-18 06:06:03,697] INFO Registered broker 6 at path /brokers/ids/6 with addresses: PLAINTEXT -> EndPoint(kafka305.prod.local,9092,PLAINTEXT) (kafka.utils.ZkUtils)
      [2016-04-18 06:06:03,698] INFO done re-registering broker (kafka.server.KafkaHealthcheck)
      [2016-04-18 06:06:03,698] INFO Partition [__consumer_offsets,26] on broker 6: Shrinking ISR for partition [__consumer_offsets,26] from 5,6,4 to 6,4 (kafka.cluster.Partition)
      [2016-04-18 06:06:03,699] INFO Subscribing to /brokers/topics path to watch for new topics (kafka.server.KafkaHealthcheck)
      [2016-04-18 06:06:03,709] INFO Partition [__consumer_offsets,26] on broker 6: Cached zkVersion [88] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
      [2016-04-18 06:06:03,821] INFO New leader is 4 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
      [2016-04-18 06:06:12,390] INFO Partition [__consumer_offsets,47] on broker 6: Shrinking ISR for partition [__consumer_offsets,47] from 5,6,4 to 6 (kafka.cluster.Partition)
      [2016-04-18 06:06:12,406] INFO Partition [__consumer_offsets,47] on broker 6: Cached zkVersion [90] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
      [2016-04-18 06:06:12,410] INFO Partition [__consumer_offsets,41] on broker 6: Shrinking ISR for partition [__consumer_offsets,41] from 5,6,4 to 6 (kafka.cluster.Partition)
      [2016-04-18 06:06:12,421] INFO Partition [__consumer_offsets,41] on broker 6: Cached zkVersion [80] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
      [2016-04-18 06:06:12,422] INFO Partition [cdrrecords,0] on broker 6: Shrinking ISR for partition [cdrrecords,0] from 5,6,4 to 6 (kafka.cluster.Partition)
      [2016-04-18 06:06:12,432] INFO Partition [cdrrecords,0] on broker 6: Cached zkVersion [170] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
      [2016-04-18 06:06:12,432] INFO Partition [__consumer_offsets,20] on broker 6: Shrinking ISR for partition [__consumer_offsets,20] from 5,6,4 to 6 (kafka.cluster.Partition)
      [2016-04-18 06:06:12,442] INFO Partition [__consumer_offsets,20] on broker 6: Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
      [2016-04-18 06:06:12,442] INFO Partition [__consumer_offsets,32] on broker 6: Shrinking ISR for partition [__consumer_offsets,32] from 5,6,4 to 6 (kafka.cluster.Partition)
      [2016-04-18 06:06:12,452] INFO Partition [__consumer_offsets,32] on broker 6: Cached zkVersion [83] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
      [2016-04-18 06:06:12,452] INFO Partition [cdrrecords-errors,1] on broker 6: Shrinking ISR for partition [cdrrecords-errors,1] from 5,6,4 to 6 (kafka.cluster.Partition)
      [2016-04-18 06:06:12,462] INFO Partition [cdrrecords-errors,1] on broker 6: Cached zkVersion [135] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
      [2016-04-18 06:06:12,462] INFO Partition [__consumer_offsets,38] on broker 6: Shrinking ISR for partition [__consumer_offsets,38] from 5,6,4 to 6 (kafka.cluster.Partition)
      [2016-04-18 06:06:12,503] INFO Partition [__consumer_offsets,38] on broker 6: Cached zkVersion [83] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
      [2016-04-18 06:06:12,503] INFO Partition [__consumer_offsets,5] on broker 6: Shrinking ISR for partition [__consumer_offsets,5] from 5,6,4 to 6 (kafka.cluster.Partition)
      [2016-04-18 06:06:12,515] INFO Partition [__consumer_offsets,5] on broker 6: Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
      [2016-04-18 06:06:12,515] INFO Partition [__consumer_offsets,23] on broker 6: Shrinking ISR for partition [__consumer_offsets,23] from 5,6,4 to 6 (kafka.cluster.Partition)
      [2016-04-18 06:06:12,526] INFO Partition [__consumer_offsets,23] on broker 6: Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
      [2016-04-18 06:06:12,526] INFO Partition [events-errors,2] on broker 6: Shrinking ISR for partition [events-errors,2] from 5,6,4 to 6 (kafka.cluster.Partition)
      [2016-04-18 06:06:12,536] INFO Partition [events-errors,2] on broker 6: Cached zkVersion [167] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
      [2016-04-18 06:06:12,537] INFO Partition [events,2] on broker 6: Shrinking ISR for partition [events,2] from 5,6,4 to 6 (kafka.cluster.Partition)
      [2016-04-18 06:06:12,546] INFO Partition [events,2] on broker 6: Cached zkVersion [237] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
      [2016-04-18 06:06:12,547] INFO Partition [__consumer_offsets,8] on broker 6: Shrinking ISR for partition [__consumer_offsets,8] from 5,6,4 to 6 (kafka.cluster.Partition)
      [2016-04-18 06:06:12,557] INFO Partition [__consumer_offsets,8] on broker 6: Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
      [2016-04-18 06:06:12,557] INFO Partition [__consumer_offsets,44] on broker 6: Shrinking ISR for partition [__consumer_offsets,44] from 5,6,4 to 6 (kafka.cluster.Partition)
      [2016-04-18 06:06:12,567] INFO Partition [__consumer_offsets,44] on broker 6: Cached zkVersion [83] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
      [2016-04-18 06:06:12,567] INFO Partition [__consumer_offsets,29] on broker 6: Shrinking ISR for partition [__consumer_offsets,29] from 5,6,4 to 6 (kafka.cluster.Partition)

      and this goes on until the broker is restarted

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kichristensen Kim Christensen
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: