Hide
We noticed that the time it takes to recover after agent failover is non constant (specifically related to number of executors). To allow people have some idea on this, we should create metrics.
What I propose specifically:
- a gauge about how many executor directories need to be scanned;
- a gauge about how long agent takes to scan them.
Alternatively, this may be done through a log line, but it may be harder to aggregate or monitor.
Show
We noticed that the time it takes to recover after agent failover is non constant (specifically related to number of executors). To allow people have some idea on this, we should create metrics.
What I propose specifically:
- a gauge about how many executor directories need to be scanned;
- a gauge about how long agent takes to scan them.
Alternatively, this may be done through a log line, but it may be harder to aggregate or monitor.