Currently, the scheduler_heartbeat metric exposed with the statsd integration is a gauge. I'm proposing to change the gauge to a counter for a better integration with Prometheus via the [statsd_exporter|https://github.com/prometheus/statsd_exporter.]
Rather than pointing Airflow at an actual statsd server, you can point it at this exporter, which will accumulate the metrics and expose them to be scraped by Prometheus at /metrics. The problem is that once this value is set when the scheduler runs its first loop, it will always be exposed to Prometheus as 1. The scheduler can crash, or be turned off and the statsd exporter will report a 1 until it is restarted and rebuilds its internal state.
By turning this metric into a counter, we can detect an issue with the scheduler by graphing and alerting using a rate. If the rate of change of the counter drops below what it should be at (determined by the scheduler_heartbeat_secs setting), we can fire an alert.
This should be helpful for adoption in Kubernetes environments where Prometheus is pretty much the standard.
- links to