We encountered an interesting problem with our connect cluster. At times, seemingly randomly, some connect sink task metrics would randomly disappear from Datadog (which is where we are sending these metrics to). After some investigation, I noticed that the metrics in question weren't being reported by the connect servers themselves.
After some more investigation, I noticed that the metrics stopped reporting after a rebalance was triggered. Our logs were filled with "Graceful stop of task ... failed". So, further digging to understand what was happening in the code when this happens, it appears that this error means that the stopping of tasks timed out for whatever reason, and the connect cluster will no longer wait for them to stop. They will still stop eventually, but in the meantime new tasks can be spun up. (https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L587, which calls https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerTask.java#L120)
So, new tasks are being spun up, and begin consuming records and doing work. Then, at some point, the old task is removed, and the very last thing that happens when the old task is removed is that the metric group associated with that task is removed. (https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerTask.java#L232 which, in this case, calls https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L179)
The issue with this is that task based metrics are registered based on a set of tags that one would expect to not change during runtime. Meaning that, when the old task IS EVENTUALLY REMOVED, it is removing the metric group that the new task is using (if the new task came up on the same connect node that the old task was running on). (https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L721)
I tried increasing the "task.shutdown.graceful.timeout.ms" config by 3 times what it had previously been set to, however that did not completely remove the problem. Also, even if it did, it doesn't change the fact that a minor network blip on my connect cluster could result in us needing to redeploy the code simply because metrics went missing due to task shut downs taking longer than intended.