[KAFKA-17252] Forwarded source task zombie fencings may fail when leader has just started - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: connect
Labels:
None

Description

We've observed some flaky integration test failures such as this one where a source task fails to start with exactly-once support enabled with this stack trace:

org.apache.kafka.connect.runtime.rest.errors.ConnectRestException: This worker is still starting up and has not been able to read a session key from the config topic yet
        at org.apache.kafka.connect.runtime.rest.RestClient.httpRequest(RestClient.java:186)
        at org.apache.kafka.connect.runtime.rest.RestClient.httpRequest(RestClient.java:140)
        at org.apache.kafka.connect.runtime.rest.RestClient.httpRequest(RestClient.java:101)
        at org.apache.kafka.connect.runtime.distributed.DistributedHerder.lambda$fenceZombieSourceTasks$23(DistributedHerder.java:1329)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)

This occurs because the leader has not yet read (or believes it has not yet read) a session key from the config topic.

However, in a cluster where all nodes have always used the sessioned rebalance protocol, this scenario should be impossible: there must be a session key present in the topic in order for a leader to handle external requests (such as creating connectors), and all workers must read to the end of all internal topics before joining the cluster.

The cause of this failure is that, during startup, session keys read from the config topic are ignored. The herder does check its config state snapshot for a session key in its tick thread loop, which should help, but it's possible that a worker joins the cluster, becomes the leader, and receives a request from a follower to fence a zombie source task before this check occurs, which will then cause the leader to response with a 503 error, failing the task.

Attachments

Issue Links

links to

GitHub Pull Request #16788

Activity

People

Assignee:: Chris Egerton

Reporter:: Chris Egerton

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 03/Aug/24 00:27

Updated:: 07/Aug/24 00:25