Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3310

Checksum scan results for lagging replicas can be confusing

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • ops-tooling
    • None

    Description

      When running a checksum scan, we've seen cases where the following is reported:

      Error: Remote error: Service unavailable: Timed out: could not wait for desired snapshot timestamp to be consistent: Timed out waiting for ts: P: 1621906 798986764 usec, L: 0 to be safe (mode: NON-LEADER). Current safe time: P: 1621906798962044 usec, L: 0 Physical time difference: 0.025s
      

      and this results in messages like:

      Aborted: checksum scan error: 1 errors were detected
      

      Without much context about Kudu, this makes it seem like there is some corruption between replicas, even though the issue is just that the replica is lagging a bit. We should consider either:

      • allowing the wait time to be configured when running the tool, or
      • reword the result such that it's clear the scan failed and no checksums were verified for the tablet

      Attachments

        Activity

          People

            Unassigned Unassigned
            awong Andrew Wong
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: