Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2354

In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly retries operations to add a replacement replica even if replacement is no longer needed

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.7.0
    • None
    • master
    • None
    • 3 tservers in the cluster, single master (?)

    Description

      In a scenario reported by adar, 100 iterations of the following command were run:

      kudu perf loadgen --keep-auto-table --table-num-buckets=40 --num-rows-per-thread=1 --table-num-replicas=3
      

      That took about 10-15 minutes to complete, and for some reason ksck reported UNAVAILABLE tablets for 5-10 minutes after that. Most likely, due to the spike of IO activity, tablet leaders didn't receive heartbeats from some replicas and tried to replace those. After some time, the cluster has stabilized (no problems reported by ksck), but in the master's log the following messages continued to appear:

      I0315 13:52:00.871310 106157 catalog_manager.cc:3234] Sending ChangeConfig:ADD_PEER:NON_VOTER on tablet 2776eb10c241426e90ddf7354260ee04 (attempt 22)
      I0315 13:52:00.871354 106157 catalog_manager.cc:2700] Scheduling retry of ChangeConfig:ADD_PEER:NON_VOTER RPC for tablet 2776eb10c241426e90ddf7354260ee04 with cas_config_opid_index -1 with a delay of 60018 ms (attempt = 22)
      

      Of course, in case of just 3 tservers in the cluster not a single attempt to add a replacement non-voter replica would succeed, but it would make sense to stop retrying those operations when a tablet's OpId index is far ahead of the cas_config_opid_index of the operation being retried.

      Attachments

        Activity

          People

            Unassigned Unassigned
            aserbin Alexey Serbin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: