Suppose you have three masters. One suffers a hardware failure. Normally you'd follow the steps outlined here to safely replace the dead master. But suppose you forget, and instead, you bring in a new machine with the same DNS name and IP address and start a fresh master with the same configuration (--fs_wal_dir and --master_addrs) as the dead master.
Now you're in a bind, because the new master has a new UUID which the two remaining masters don't expect. It is unable to communicate with them, doesn't join their consensus group, and thus the multi-master deployment remains degraded.
The workflow to fix this is a variant of the recovery workflow:
- Stop the new master.
- Delete all of the data out of the new master's WAL and data directories.
- Run sudo -u kudu kudu fs format using the new master's WAL and data directories.
- Run sudo -u kudu pbc edit on the various FS instance files (one in the WAL directory and one in each data directory) in the new master. Replace the UUID created during the format operation with the old master's UUID, which is expected by the two remaining masters. Note: kudu pbc edit expects the UUID as a base64-encoded string; you'll need to base64-encode the old UUID before splicing it in.
- Run sudo -u kudu kudu remote_replica copy to copy the master tablet from one of the good masters.
- Start the new master.
- Wait for it to load the master tablet and join consensus. You can probably use ksck to determine when this will happen.