Something seems to be wrong about how ksck handles checksum timestamps. I have a recently-restarted cluster, and I ran ksck. One of the tablets has a replica which was "lost" – ie it fell too far behind and therefore could never be caught up. ksck is just reporting it as a bad checksum. Shouldn't it instead try to wait until the provided timestamp is "safe", and if the wait times out, give an error that it's too far behind?
As a stopgap, maybe we could have ksck also include the latest opid in the error printout, to make it more obvious that a server is just "behind" and not divergent?