Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1278

Tablets that take >5 minutes to copy will never remote bootstrap

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.7.0
    • Component/s: consensus
    • Labels:
      None
    • Target Version/s:

      Description

      Binglin Chang and I debugged this issue on his cluster. One of the servers had been shut down due to bad RAM, so it triggered remote bootstrap of all of its tablets to create new replicas.

      During remote bootstrap, the leader replica continues to try to replicate operations to the new follower, while it's in the process of bootstrapping. This causes it to try to trigger remote bootstrap, which fails with a "Remote bootstrap already in progress" error. The leader considers this to be an unsuccessful communication with the follower. After 5 minutes of receiving this error, it will decide that the follower is dead and evict it, and request another new replica. When the previous replica finishes, it will find out that it's been evicted, and delete everything it just copied. This cycle repeats forever.

      We need to fix the leader so that, as long as the remote bootstrapping replica is making progress, we don't consider it dead.

        Attachments

          Activity

            People

            • Assignee:
              decster Binglin Chang
              Reporter:
              tlipcon Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: