Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3010

unsafe_change_config can lead to a crash

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • consensus, ops-tooling
    • None

    Description

      I've seen a case of running the unsafe_change_config tool, per the steps laid out in the "Bringing a tablet that has lost a majority of replicas" steps, crashing a tserver with the following error:

      I1028 08:24:31.241361 38436 raft_consensus.cc:684] T b90b0429806747a6b993d8543ab5fd50 P f344ade17ed94072b8839007ccc7570a [term 40 FOLLOWER]: Illegal state: RaftConfig change currently pending. Only one is allowed at a time.
      W1028 08:24:31.241379 38436 raft_consensus.cc:1373] T b90b0429806747a6b993d8543ab5fd50 P f344ade17ed94072b8839007ccc7570a [term 40 FOLLOWER]: Could not prepare transaction for op 34.48 and following 69 ops. Status for this op: Illegal state: RaftConfig change currently pending. Only one is allowed at a time.
      I1028 08:26:07.300520 38436 raft_consensus.cc:1058] T a6bfa86e43b74cfaa6feba4631879251 P f344ade17ed94072b8839007ccc7570a [term 17 FOLLOWER]: Refusing update from remote peer f1a7fb14b7b44a5c8b31e93114d79a8d: Log matching property violated. Preceding OpId in replica: term: 15 index: 93. Preceding OpId from leader: term: 17 index: 112. (index mismatch)
      I1028 08:26:07.301476 38436 raft_consensus.cc:2819] T a6bfa86e43b74cfaa6feba4631879251 P f344ade17ed94072b8839007ccc7570a [term 17 NON_PARTICIPANT]: Allowing unsafe config change even though there is a pending config! Existing pending config: opid_index: 95 OBSOLETE_local: false peers { permanent_uuid: "7875cc5598a44bd893998cba7bd2cc47" member_type: VOTER last_known_addr{ host: "foo01.server.net" port: 7050 }
      attrs{ promote: false }
      } peers { permanent_uuid: "f1a7fb14b7b44a5c8b31e93114d79a8d" member_type: NON_VOTER last_known_addr{ host: "foo04.server.net" port: 7050 }
      attrs{ promote: true }
      } unsafe_config_change: true; New pending config: opid_index: 96 OBSOLETE_local: false peers { permanent_uuid: "7875cc5598a44bd893998cba7bd2cc47" member_type: VOTER last_known_addr{ host: "foo01.server.net" port: 7050 }
      attrs{ promote: false }
      } peers { permanent_uuid: "f1a7fb14b7b44a5c8b31e93114d79a8d" member_type: NON_VOTER last_known_addr{ host: "foo04.server.net" port: 7050 }
      attrs{ promote: true }
      } peers { permanent_uuid: "231e6fdad22647978c9a76c07407da4c" member_type: NON_VOTER last_known_addr{ host: "foo02.server.net" port: 7050 }
      attrs{ promote: true }
      } unsafe_config_change: true
      F1028 08:26:07.302338 38436 pending_rounds.cc:179] Check failed: _s.ok() Bad status: Corruption: New operation's term is not >= than the previous op's term. Current: 14.94. Previous: 15.93
      

      It seems like the tool is permitting the persistence of a bad op, considering there's already a config change in flight.

      Attachments

        1. foo03.out
          54 kB
          Andrew Wong
        2. foo04.out
          15 kB
          Andrew Wong
        3. foo06.out
          22 kB
          Andrew Wong

        Activity

          People

            Unassigned Unassigned
            awong Andrew Wong
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: