Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.4.1
-
None
-
None
-
None
Description
Hello,
around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from 2.3.1),
and we started noticing a troubling behavior that we didn't see before:
Without apparent reason, a specific partition on a specific consumer loses its offset and start re-consuming the entire partition from the beginning (according to the retention).
Messages appearing on the consumer (client):
Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer clientId=consumer-fireAttributionConsumerGroup4-2, groupId=fireAttributionConsumerGroup4] Resetting offset for partition trackingSolutionAttribution-48 to offset 1216430527.
Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer clientId=consumer-fireAttributionConsumerGroup4-2, groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of range for partition trackingSolutionAttribution-48
Those are the logs from the brokers at the same time (searched for "trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4")
Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 1222791065
Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 1222791065
Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 in 0 ms.
Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive groups
Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive groups
Another way to see the same thing, from our monitoring (DD) on the partition offset:
The recovery you are seeing is after I run partition offset reset manually (using kafka-consumer-groups.sh --bootstrap-server localhost:9092 --topic trackingSolutionAttribution:57 --group fireAttributionConsumerGroup4 --reset-offsets --to-datetime 'SOME DATE')
Any idea what can be causing this? we have it happen to us at least 5 times since the upgrade, and before that, I don't remember it ever happening to us.
Topic config is set to default, except the retention, which is manually set to 43200000.
The topic has 60 partitions & a replication factor of 2.
Consumer config:
ConsumerConfig values: allow.auto.create.topics = true auto.commit.interval.ms = 5000 auto.offset.reset = earliest bootstrap.servers = [..........] check.crcs = true client.dns.lookup = default client.id = client.rack = connections.max.idle.ms = 540000 default.api.timeout.ms = 60000 enable.auto.commit = true exclude.internal.topics = true fetch.max.bytes = 52428800 fetch.max.wait.ms = 500 fetch.min.bytes = 1 group.id = fireAttributionConsumerGroup4 group.instance.id = null heartbeat.interval.ms = 10000 interceptor.classes = [] internal.leave.group.on.close = true isolation.level = read_uncommitted key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer max.partition.fetch.bytes = 1048576 max.poll.interval.ms = 300000 max.poll.records = 500 metadata.max.age.ms = 300000 metric.reporters = [] metrics.num.samples = 2 metrics.recording.level = INFO metrics.sample.window.ms = 30000 partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor] receive.buffer.bytes = 65536 reconnect.backoff.max.ms = 1000 reconnect.backoff.ms = 50 request.timeout.ms = 30000 retry.backoff.ms = 100 sasl.client.callback.handler.class = null sasl.jaas.config = null sasl.kerberos.kinit.cmd = /usr/bin/kinit sasl.kerberos.min.time.before.relogin = 60000 sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 sasl.kerberos.ticket.renew.window.factor = 0.8 sasl.login.callback.handler.class = null sasl.login.class = null sasl.login.refresh.buffer.seconds = 300 sasl.login.refresh.min.period.seconds = 60 sasl.login.refresh.window.factor = 0.8 sasl.login.refresh.window.jitter = 0.05 sasl.mechanism = GSSAPI security.protocol = PLAINTEXT security.providers = null send.buffer.bytes = 131072 session.timeout.ms = 30000 ssl.cipher.suites = null ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] ssl.endpoint.identification.algorithm = https ssl.key.password = null ssl.keymanager.algorithm = SunX509 ssl.keystore.location = null ssl.keystore.password = null ssl.keystore.type = JKS ssl.protocol = TLS ssl.provider = null ssl.secure.random.implementation = null ssl.trustmanager.algorithm = PKIX ssl.truststore.location = null ssl.truststore.password = null ssl.truststore.type = JKS value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
Broker config (server.properties) - We have 9 brokers:
num.io.threads=4 num.network.threads=64 socket.send.buffer.bytes=33554432 socket.receive.buffer.bytes=33554432 socket.request.max.bytes=104857600 num.partitions=2 log.retention.hours=1 log.segment.bytes=536870912 log.retention.check.interval.ms=60000 zookeeper.connection.timeout.ms=1000000 controlled.shutdown.enable=true auto.leader.rebalance.enable=true log.cleaner.enable=true log.cleaner.min.cleanable.ratio=0.1 log.cleaner.threads=2 log.cleanup.policy=delete log.cleaner.delete.retention.ms=86400000 log.cleaner.io.max.bytes.per.second=1.7976931348623157E308 num.recovery.threads.per.data.dir=1 log.flush.interval.messages=9223372036854775807 message.max.bytes=10000000 replica.fetch.max.bytes=10000000 default.replication.factor=2 delete.topic.enable=true unclean.leader.election.enable=false compression.type=snappy inter.broker.protocol.version=2.4
Let me know if you need more data from me
Thanks
Attachments
Attachments
Issue Links
- duplicates
-
KAFKA-9543 Consumer offset reset after new segment rolling
- Resolved