Details
-
Bug
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
1.1.0
-
None
-
Our Setup details as follows
Confluent Kafka Image : confluentinc/cp-enterprise-kafka:4.1.0
In testing setup, we are using Single Broker setup, Deployed in a K8S cluster
We newly deployed our application including broker in K8S cluster, observed the following issue for the first time, resulting in our applications failed to come upOur Setup details as follows Confluent Kafka Image : confluentinc/cp-enterprise-kafka:4.1.0 In testing setup, we are using Single Broker setup, Deployed in a K8S cluster We newly deployed our application including broker in K8S cluster, observed the following issue for the first time, resulting in our applications failed to come up
Description
__
1. Most of the consumers got stuck while reading the data from Kafka topic, the stuck stack trace is given as below, After certain timeout application got restarted, try to connect with the same consumer group, however, it still went to same stuck stack
"main" #1 prio=5 os_prio=0 tid=0x0000000001811800 nid=0x194 runnable [0x00007ffe513bd000]
java.lang.Thread.State: RUNNABLE
at org.apache.kafka.common.network.PlaintextTransportLayer.read(PlaintextTransportLayer.java:104)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:122)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:93)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:235)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:196)
at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:557)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:495)
at org.apache.kafka.common.network.Selector.poll(Selector.java:424)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:460)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:261)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitMetadataUpdate(ConsumerNetworkClient.java:156)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(AbstractCoordinator.java:228)
- locked <0x00000000ae7acf08> (a org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(AbstractCoordinator.java:205)
- locked <0x00000000ae7acf08> (a org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.fetchCommittedOffsets(ConsumerCoordinator.java:465)
at org.apache.kafka.clients.consumer.KafkaConsumer.committed(KafkaConsumer.java:1461)
2. To debug further installed KafkaCat, tried to consume the data using same consumer group which is getting stuck, and then with the new consumer group. Stuck consumer group we are not able to consume data, however new consumer group it was able to consume the data, the error is seen for stuck consumer group as follows
7|1528304675.172|COMMIT|rdkafka#consumer-1| OffsetCommit for -1 partition(s) returned: Local: No offset stored
%7|1528304675.172|UNASSIGN|rdkafka#consumer-1| Group "agent.defaultagent": unassign done in state wait-broker (join state init): without new assignment: OffsetCommit done (__NO_OFFSET)
%7|1528304675.223|CGRPQUERY|rdkafka#consumer-1| broker:9092/bootstrap: Group "agent.defaultagent": querying for coordinator: intervaled in state wait-broker
%7|1528304675.244|SEND|rdkafka#consumer-1| broker:9092/bootstrap: Sent GroupCoordinatorRequest (v0, 41 bytes @ 0, CorrId 25)
%7|1528304675.255|RECV|rdkafka#consumer-1| broker:9092/bootstrap: Received GroupCoordinatorResponse (v0, 12 bytes, CorrId 25, rtt 10.91ms)
%7|1528304675.326|CGRPCOORD|rdkafka#consumer-1| broker:9092/bootstrap: Group "agent.defaultagent" GroupCoordinator response error: Broker: Group coordinator not available
%7|1528304676.226|CGRPQUERY|rdkafka#consumer-1| broker-0.broker.default.svc.cluster.local:9092/0: Group "agent.defaultagent": querying for coordinator: intervaled in state wait-broker
%7|1528304676.330|SEND|rdkafka#consumer-1| broker-0.broker.default.svc.cluster.local:9092/0: Sent GroupCoordinatorRequest (v0, 41 bytes @ 0, CorrId 33)
%7|1528304676.350|RECV|rdkafka#consumer-1| broker-0.broker.default.svc.cluster.local:9092/0: Received GroupCoordinatorResponse (v0, 12 bytes, CorrId 33, rtt 19.93ms)
%7|1528304676.430|CGRPCOORD|rdkafka#consumer-1| broker-0.broker.default.svc.cluster.local:9092/0: Group "agent.defaultagent" GroupCoordinator response error: Broker: Group coordinator not available
%7|1528304677.226|CGRPQUERY|rdkafka#consumer-1| broker:9092/bootstrap: Group "agent.defaultagent": querying for coordinator: intervaled in state wait-broker
3. Tried to delete the stuck consumer group, however, its failing with the same highlighted error
Error: Deletion of some consumer groups failed:
- Group 'agent.defaultagent' could not be deleted due to: COORDINATOR_NOT_AVAILABLE
4. From the link I can see http://home.apache.org/~ewencp/kafka-0.10.2.0-rc1/javadoc/org/apache/kafka/common/errors/GroupCoordinatorNotAvailableException.html this is a temporary issue, will get resolved once offset topic created, but in our case, it's not recovered, however for the same topic with different consumer group consumption is happenings
Can you let me know the way to recover the system, without restarting the broker or Zookeeper, What is the way to avoid this race condition, also is this is a bug in Kafka?
Let me know if any other details required