Details
Description
When source tasks request source offsets from the framework, this results in a call to Future.get() with no timeout. In distributed workers, the future is blocked on a successful read to the end of the source offsets topic, which in turn will poll that topic indefinitely until the latest messages for every partition of that topic have been consumed.
This normally completes in a reasonable amount of time. However, if the connectivity between the Connect worker and the Kafka cluster is degraded or dropped in the middle of one of these reads, it will block until connectivity is restored and the request completes successfully.
If a task is stopped (due to a manual restart via the REST API, a rebalance, worker shutdown, etc.) while blocked on a read of source offsets during its start method, not only will it fail to gracefully stop, but the framework will not even invoke its stop method until its start method (and, as a result, the source offset read request) has completed. This prevents the task from being able to clean up any resources it has allocated and can lead to OOM errors, excessive thread creation, and other problems.
I've confirmed that this affects every release of Connect back through 1.0 at least; I've tagged the most recent bug fix release of every major/minor version from then on in the Affects Version/s field to avoid just putting every version in that field.
Attachments
Issue Links
- links to