Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-9044

Brokers occasionally (randomly?) dropping out of clusters

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Bug
    • 2.3.0, 2.3.1
    • None
    • core
    • None
    • Ubuntu 14.04

    Description

      I have several cluster running kafka 2.3.1 and 2.3.0 and this issue has affected all of them. Because of replication and the size of the clusters (30 brokers), this bug is not causing any data loss, but it is nevertheless concerning. When a broker drops out, the log gives no indication that there are any zookeeper issues (and indeed the zookeepers are healthy when this occurs. Here's snippet from a broker log when it occurs:

      [2019-10-07 11:02:27,630] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, dir=/data/3/kl] Found deletable segments with base offsets [1975332] due to retention time 3600000ms breach (kafka.log.Log)
      [2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, dir=/data/3/kl] Scheduling log segment [baseOffset 1975332, size 92076008] for deletion. (kafka.log.Log)
      [2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, dir=/data/3/kl] Incrementing log start offset to 2000317 (kafka.log.Log)
      [2019-10-07 11:03:56,936] INFO [Log partition=internal_test-52, dir=/data/3/kl] Deleting segment 1975332 (kafka.log.Log)
      [2019-10-07 11:03:56,957] INFO Deleted log /data/3/kl/internal_test-52/00000000000001975332.log.deleted. (kafka.log.LogSegment)
      [2019-10-07 11:03:56,957] INFO Deleted offset index /data/3/kl/internal_test-52/00000000000001975332.index.deleted. (kafka.log.LogSegment)
      [2019-10-07 11:03:56,958] INFO Deleted time index /data/3/kl/internal_test-52/00000000000001975332.timeindex.deleted. (kafka.log.LogSegment)
      [2019-10-07 11:12:27,630] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 11:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 11:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 11:42:27,629] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 11:52:27,629] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 12:02:27,629] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 12:12:27,630] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 12:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 12:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 12:42:27,630] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 1 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 12:52:27,629] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 13:02:27,630] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 13:12:27,629] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 13:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 13:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 13:42:27,629] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 13:52:27,630] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 1 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 14:02:27,629] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 14:12:27,629] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 14:22:27,630] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 1 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 14:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 14:42:27,629] INFO [GroupMetadataManager brokerId=14] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-10-07 14:46:07,510] INFO [Partition internal_test-33 broker=14] Shrinking ISR from 16,17,14 to 14. Leader: (highWatermark: 2007553, endOffset: 2007555). Out of sync replicas: (brokerId: 16, endOffset: 2007553) (brokerId: 17, endOffset: 2007553). (kafka.cluster.Partition)
      [2019-10-07 14:46:07,511] INFO [Partition internal_test-33 broker=14] Cached zkVersion [17] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)

      The controller log shows the following:

      [2019-10-07 14:45:55,427] INFO [Controller id=24] Newly added brokers: , deleted brokers: 14, bounced brokers: , all live brokers: 1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25 (kafka.controller.KafkaController)
      [2019-10-07 14:45:55,477] INFO [RequestSendThread controllerId=24] Shutting down (kafka.controller.RequestSendThread)
      [2019-10-07 14:45:55,477] INFO [RequestSendThread controllerId=24] Shutdown completed (kafka.controller.RequestSendThread)
      [2019-10-07 14:45:55,477] INFO [RequestSendThread controllerId=24] Stopped (kafka.controller.RequestSendThread)
      [2019-10-07 14:45:55,481] INFO [Controller id=24] Broker failure callback for 14 (kafka.controller.KafkaController)

      The clusters use zookeeper.session.timeout.ms=45000, and zookeeper.connection.timeout.ms=90000.

      I'm unable to find a cause for this behavior. The only thing I can do to resolve the issue is to restart the broker.

      Attachments

        Activity

          People

            Unassigned Unassigned
            pmbuko Peter Bukowinski
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: