Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2924

Let newly rereplicated replicas try to catch up before evicting them



    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.11.0
    • None
    • consensus, tablet copy
    • None


      In heavily loaded clusters with a high rate of ingest, laggy FOLLOWER eviction can lead to unsatisfiable tablet copy loops. This plays out something like this:

      1. Replication group containing replicas A, B, C. A is the leader.
      2. Due to load, C starts to lag behind A.
      3. Eventually, C is evicted.
      4. A new replica D is added elsewhere and tablet copy begins from A. It's going to copy WAL ops M..N, where M is the oldest op not yet flushed, and N is the most recent op written.
      5. Due to a separate bug (detailed below), A actually thinks D needs ops L..N where L is close to but a bit before M.
      6. More and more data is written to A and replicated to B. The op index eventually climbs up to O, where segment(O) - segment(M) exceeds the maximum number of segments to retain.
      7. A GCs all ops up to M, including L. D can no longer catch up and is evicted, even before the tablet copy is finished.
      8. A new replica E is added and tablet copy begins from A. The cycle repeats.

      Even if that separate bug is fixed, A will release its anchor on ops M..N when D finishes copying, which means D will still be evicted before it has a chance to catch up.

      Why does this matter? Isn't it "correct" that D can't catch up and thus should be evicted? Well, yes, but we've just spent a bunch of cluster resources on a tablet copy that amounted to nothing useful. We should try to get our money's worth first by giving D one "free" catch-up: don't evict D unless it falls behind after catching up to O, or if some timer expires.

      The aforementioned separate bug: the addition of D and its tablet copy are two separate events. When D is added, we use a conservative estimate to figure out what op it should have:

        // We don't know the last operation received by the peer so, following the
        // Raft protocol, we set next_index to one past the end of our own log. This
        // way, if calling this method is the result of a successful leader election
        // and the logs between the new leader and remote peer match, the
        // peer->next_index will point to the index of the soon-to-be-written NO_OP
        // entry that is used to assert leadership. If we guessed wrong, and the peer
        // does not have a log that matches ours, the normal queue negotiation
        // process will eventually find the right point to resume from.
        tracked_peer->next_index = queue_state_.last_appended.index() + 1;

      When the tablet copy begins, A anchors to the last op in its WAL. If the tablet copy starts after the addition of D, tracked_peer->next_index will be too conservative, and even though all the necessary ops will be copied to D, A may evict D if tracked_peer->next_index is GC'ed.




            Unassigned Unassigned
            adar Adar Dembo
            0 Vote for this issue
            1 Start watching this issue