Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-9824

Consumer loses partition offset and resets post 2.4.1 version upgrade

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.4.1
    • None
    • None
    • None

    Description

      Hello,
      around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from 2.3.1), 
      and we started noticing a troubling behavior that we didn't see before:
       
      Without apparent reason, a specific partition on a specific consumer loses its offset and start re-consuming the entire partition from the beginning (according to the retention).
       
      Messages appearing on the consumer (client):

      Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer clientId=consumer-fireAttributionConsumerGroup4-2, groupId=fireAttributionConsumerGroup4] Resetting offset for partition trackingSolutionAttribution-48 to offset 1216430527.

      Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer clientId=consumer-fireAttributionConsumerGroup4-2, groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of range for partition trackingSolutionAttribution-48

      Those are the logs from the brokers at the same time (searched for "trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4")

      Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 1222791065
       
      Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 1222791065
       
      Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 in 0 ms.
       
      Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive groups
       
      Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive groups

      Another way to see the same thing, from our monitoring (DD) on the partition offset:

      The recovery you are seeing is after I run partition offset reset manually (using kafka-consumer-groups.sh --bootstrap-server localhost:9092 --topic trackingSolutionAttribution:57 --group fireAttributionConsumerGroup4 --reset-offsets --to-datetime 'SOME DATE')
       
      Any idea what can be causing this? we have it happen to us at least 5 times since the upgrade, and before that, I don't remember it ever happening to us.
       
      Topic config is set to default, except the retention, which is manually set to 43200000.
      The topic has 60 partitions & a replication factor of 2. 
       
      Consumer config:

      ConsumerConfig values:
      	allow.auto.create.topics = true
      	auto.commit.interval.ms = 5000
      	auto.offset.reset = earliest
      	bootstrap.servers = [..........]
      	check.crcs = true
      	client.dns.lookup = default
      	client.id =
      	client.rack =
      	connections.max.idle.ms = 540000
      	default.api.timeout.ms = 60000
      	enable.auto.commit = true
      	exclude.internal.topics = true
      	fetch.max.bytes = 52428800
      	fetch.max.wait.ms = 500
      	fetch.min.bytes = 1
      	group.id = fireAttributionConsumerGroup4
      	group.instance.id = null
      	heartbeat.interval.ms = 10000
      	interceptor.classes = []
      	internal.leave.group.on.close = true
      	isolation.level = read_uncommitted
      	key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
      	max.partition.fetch.bytes = 1048576
      	max.poll.interval.ms = 300000
      	max.poll.records = 500
      	metadata.max.age.ms = 300000
      	metric.reporters = []
      	metrics.num.samples = 2
      	metrics.recording.level = INFO
      	metrics.sample.window.ms = 30000
      	partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
      	receive.buffer.bytes = 65536
      	reconnect.backoff.max.ms = 1000
      	reconnect.backoff.ms = 50
      	request.timeout.ms = 30000
      	retry.backoff.ms = 100
      	sasl.client.callback.handler.class = null
      	sasl.jaas.config = null
      	sasl.kerberos.kinit.cmd = /usr/bin/kinit
      	sasl.kerberos.min.time.before.relogin = 60000
      	sasl.kerberos.service.name = null
      	sasl.kerberos.ticket.renew.jitter = 0.05
      	sasl.kerberos.ticket.renew.window.factor = 0.8
      	sasl.login.callback.handler.class = null
      	sasl.login.class = null
      	sasl.login.refresh.buffer.seconds = 300
      	sasl.login.refresh.min.period.seconds = 60
      	sasl.login.refresh.window.factor = 0.8
      	sasl.login.refresh.window.jitter = 0.05
      	sasl.mechanism = GSSAPI
      	security.protocol = PLAINTEXT
      	security.providers = null
      	send.buffer.bytes = 131072
      	session.timeout.ms = 30000
      	ssl.cipher.suites = null
      	ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
      	ssl.endpoint.identification.algorithm = https
      	ssl.key.password = null
      	ssl.keymanager.algorithm = SunX509
      	ssl.keystore.location = null
      	ssl.keystore.password = null
      	ssl.keystore.type = JKS
      	ssl.protocol = TLS
      	ssl.provider = null
      	ssl.secure.random.implementation = null
      	ssl.trustmanager.algorithm = PKIX
      	ssl.truststore.location = null
      	ssl.truststore.password = null
      	ssl.truststore.type = JKS
      	value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
      

       
      Broker config (server.properties) - We have 9 brokers:

      num.io.threads=4
      num.network.threads=64
      socket.send.buffer.bytes=33554432
      socket.receive.buffer.bytes=33554432
      socket.request.max.bytes=104857600
      num.partitions=2
      log.retention.hours=1
      log.segment.bytes=536870912
      log.retention.check.interval.ms=60000
      zookeeper.connection.timeout.ms=1000000
      controlled.shutdown.enable=true
      auto.leader.rebalance.enable=true
      log.cleaner.enable=true
      log.cleaner.min.cleanable.ratio=0.1
      log.cleaner.threads=2
      log.cleanup.policy=delete
      log.cleaner.delete.retention.ms=86400000
      log.cleaner.io.max.bytes.per.second=1.7976931348623157E308
      num.recovery.threads.per.data.dir=1
      log.flush.interval.messages=9223372036854775807
      message.max.bytes=10000000
      replica.fetch.max.bytes=10000000
      default.replication.factor=2
      delete.topic.enable=true
      unclean.leader.election.enable=false
      compression.type=snappy
      inter.broker.protocol.version=2.4
      

       
      Let me know if you need more data from me
       
      Thanks

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              nitay.k Nitay Kufert
              Votes:
              5 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: