[KUDU-3048] Add time/clock synchronization metrics - ASF JIRA

XML

Word

Printable

JSON

For better visibility, it would be great to add metrics reflecting time/clock synchronization parameters:

the stats on the max_error sampled while reading the underlying clock
the stats on time intervals when the underlying clock was extrapolated instead of using the actual readings: number of such intervals and stats on the interval duration
whether hybrid clock timestamps are generated using interpolated clock readings instead of real ones
if using the built-in time source:
- difference between tracked true time and local wallclock
- most recently computed true time
- the stats on the maximum error of the computed true time

As for the rationale behind the new metrics:

max_error shows how far the clock is from the true time, and maybe it's time to use other set of NTP servers or instead increase the --max_clock_sync_error_usec flag value
presence of the extrapolation intervals for the hybrid clock signals about periods of non-availability for NTP servers, and possible action would be re-visiting the set of NTP servers
if hybrid timestamps are being extrapolated for some time, Kudu masters and tablet servers might crash if the clock errors eventually goes beyond the configured threshold: it's time to start troubleshooting the issue to avoid possible non-availability of the cluster
the delta between true time tracked by the built-in NTP client and the local system clock is useful to understand how the log timestamps are related to the HybridClock timestamps (in case of using the built-in NTP client those might diverge)
the stats on true time computed by the built-in NTP client give insights on the quality of the reference NTP servers

The new metrics can be used for monitoring and alerting, allowing for pro-active maintenance of a Kudu cluster.