Description
When Kafka Connect is run in distributed mode - if a sink connector's task is restarted (via a worker's REST API), the following sequence of steps will occur (on the DistributedHerder's thread):
- The existing sink task will be stopped (ref)
- A new sink task will be started (ref)
- As a part of the above step, a new WorkerSinkTask will be instantiated (ref)
- The DLQ reporter (see KIP-298) for the sink task is also instantiated and configured as a part of this (ref)
- The DLQ reporter setup involves two synchronous admin client calls to list topics and create the DLQ topic if it isn't already created (ref)
All of these are occurring synchronously on the herder's tick thread - in this portion here where external requests are run. If the admin client call in the DLQ reporter setup step blocks for some time (due to auth failures and retries or network issues or whatever other reason), this can cause the Connect worker to become non-functional (REST API requests will timeout) and even fall out of the Connect cluster and become a zombie (since the tick thread also drives group membership functions - see here, here).
Attachments
Issue Links
- relates to
-
KAFKA-9374 Worker can be disabled by blocked connectors
- Resolved
- links to