[KAFKA-15238] Connect workers can be disabled by DLQ-related blocking admin client calls - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.6.0, 3.5.2
Component/s: connect
Labels:
None

Description

When Kafka Connect is run in distributed mode - if a sink connector's task is restarted (via a worker's REST API), the following sequence of steps will occur (on the DistributedHerder's thread):

The existing sink task will be stopped (ref)
A new sink task will be started (ref)
As a part of the above step, a new WorkerSinkTask will be instantiated (ref)
The DLQ reporter (see KIP-298) for the sink task is also instantiated and configured as a part of this (ref)
The DLQ reporter setup involves two synchronous admin client calls to list topics and create the DLQ topic if it isn't already created (ref)

All of these are occurring synchronously on the herder's tick thread - in this portion here where external requests are run. If the admin client call in the DLQ reporter setup step blocks for some time (due to auth failures and retries or network issues or whatever other reason), this can cause the Connect worker to become non-functional (REST API requests will timeout) and even fall out of the Connect cluster and become a zombie (since the tick thread also drives group membership functions - see here, here).

Attachments

Issue Links

relates to

KAFKA-9374 Worker can be disabled by blocked connectors

Resolved

links to

GitHub Pull Request #14079

Activity

People

Assignee:: Yash Mayya

Reporter:: Yash Mayya

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/Jul/23 06:52

Updated:: 25/Jul/23 13:17

Resolved:: 25/Jul/23 13:17