Status: Open
Resolution: Unresolved
Currently, there is no metric in Kafka Connect to track when a source connector fails to poll data from the source. This information would be useful to operators and developers to visualize, monitor and alert when the connector fails to poll records from the source.
Existing metrics like kafka_producer_producer_metrics_record_error_total and kafka_connect_task_error_metrics_total_record_failures only cover failures when producing data to the Kafka cluster but not when the source task fails with a retryable exception or ConnectException.
Polling from source can fail due to unavailability of the source system or errors with the connect configuration. Currently, this cannot be monitored directly using metrics and instead operators have to rely on log diving which is not consistent with how other metrics are monitored.
I propose adding new metrics to Kafka Connect, "source-record-poll-error-total" and "source-record-poll-error-rate" that can be used to monitor failures during polling.
source-record-poll-error-total - The total number of times a source connector failed to poll data from the source. This will include both retryable and non-retryable exceptions.
source-record-poll-error-rate - The rate of above failures per unit of time.
These metrics would be tracked at the connector level and could be exposed through the JMX along with the other metrics.
I am willing to submit a PR if this looks good, sample implementation code below,
// protected List<SourceRecord> poll() throws InterruptedException { try { return task.poll(); } catch (RetriableException | org.apache.kafka.common.errors.RetriableException e) { log.warn("{} failed to poll records from SourceTask. Will retry operation.", this, e); sourceTaskMetricsGroup.recordPollError(); // Do nothing. Let the framework poll whenever it's ready. return null; } catch (Throwable e) { sourceTaskMetricsGroup.recordPollError(); throw e; } }