Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3010

unsafe_change_config can lead to a crash

Agile BoardAttach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: consensus, ops-tooling
    • Labels:
      None

      Description

      I've seen a case of running the unsafe_change_config tool, per the steps laid out in the "Bringing a tablet that has lost a majority of replicas" steps, crashing a tserver with the following error:

      I1028 08:24:31.241361 38436 raft_consensus.cc:684] T b90b0429806747a6b993d8543ab5fd50 P f344ade17ed94072b8839007ccc7570a [term 40 FOLLOWER]: Illegal state: RaftConfig change currently pending. Only one is allowed at a time.
      W1028 08:24:31.241379 38436 raft_consensus.cc:1373] T b90b0429806747a6b993d8543ab5fd50 P f344ade17ed94072b8839007ccc7570a [term 40 FOLLOWER]: Could not prepare transaction for op 34.48 and following 69 ops. Status for this op: Illegal state: RaftConfig change currently pending. Only one is allowed at a time.
      I1028 08:26:07.300520 38436 raft_consensus.cc:1058] T a6bfa86e43b74cfaa6feba4631879251 P f344ade17ed94072b8839007ccc7570a [term 17 FOLLOWER]: Refusing update from remote peer f1a7fb14b7b44a5c8b31e93114d79a8d: Log matching property violated. Preceding OpId in replica: term: 15 index: 93. Preceding OpId from leader: term: 17 index: 112. (index mismatch)
      I1028 08:26:07.301476 38436 raft_consensus.cc:2819] T a6bfa86e43b74cfaa6feba4631879251 P f344ade17ed94072b8839007ccc7570a [term 17 NON_PARTICIPANT]: Allowing unsafe config change even though there is a pending config! Existing pending config: opid_index: 95 OBSOLETE_local: false peers { permanent_uuid: "7875cc5598a44bd893998cba7bd2cc47" member_type: VOTER last_known_addr{ host: "foo01.server.net" port: 7050 }
      attrs{ promote: false }
      } peers { permanent_uuid: "f1a7fb14b7b44a5c8b31e93114d79a8d" member_type: NON_VOTER last_known_addr{ host: "foo04.server.net" port: 7050 }
      attrs{ promote: true }
      } unsafe_config_change: true; New pending config: opid_index: 96 OBSOLETE_local: false peers { permanent_uuid: "7875cc5598a44bd893998cba7bd2cc47" member_type: VOTER last_known_addr{ host: "foo01.server.net" port: 7050 }
      attrs{ promote: false }
      } peers { permanent_uuid: "f1a7fb14b7b44a5c8b31e93114d79a8d" member_type: NON_VOTER last_known_addr{ host: "foo04.server.net" port: 7050 }
      attrs{ promote: true }
      } peers { permanent_uuid: "231e6fdad22647978c9a76c07407da4c" member_type: NON_VOTER last_known_addr{ host: "foo02.server.net" port: 7050 }
      attrs{ promote: true }
      } unsafe_config_change: true
      F1028 08:26:07.302338 38436 pending_rounds.cc:179] Check failed: _s.ok() Bad status: Corruption: New operation's term is not >= than the previous op's term. Current: 14.94. Previous: 15.93
      

      It seems like the tool is permitting the persistence of a bad op, considering there's already a config change in flight.

        Attachments

        1. foo06.out
          22 kB
          Andrew Wong
        2. foo04.out
          15 kB
          Andrew Wong
        3. foo03.out
          54 kB
          Andrew Wong

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              awong Andrew Wong

              Dates

              • Created:
                Updated:

                Issue deployment