Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
When running a checksum scan, we've seen cases where the following is reported:
Error: Remote error: Service unavailable: Timed out: could not wait for desired snapshot timestamp to be consistent: Timed out waiting for ts: P: 1621906 798986764 usec, L: 0 to be safe (mode: NON-LEADER). Current safe time: P: 1621906798962044 usec, L: 0 Physical time difference: 0.025s
and this results in messages like:
Aborted: checksum scan error: 1 errors were detected
Without much context about Kudu, this makes it seem like there is some corruption between replicas, even though the issue is just that the replica is lagging a bit. We should consider either:
- allowing the wait time to be configured when running the tool, or
- reword the result such that it's clear the scan failed and no checksums were verified for the tablet