Details
-
Bug
-
Status: Resolved
-
Urgent
-
Resolution: Fixed
-
None
-
None
-
Critical
Description
I believe the following was observed on 0.7.2. I meant to dig deeper, but never had the time, and now I want to at least file this even if I don't have extremely helpful information.
A piece of background is that we consciously made the decision to have the default configuration on nodes have auto_bootstrap set to true. The logic was that if one accidentally were to start a new node, we'd rather have it join with data than join without data and cause bogus read results in the cluster.
We executed this policy (by way of having the puppet managed config have auto_bootstrap set to true).
On one of our clusters with 5 nodes, we did some moves. All looked well; the moves completed. For unrelated reasons, we wanted to restart nodes after they had been moved. When we did, three of the 5, specifically those 3 that were NOT seed nodes, initiated a bootstrap procedure! Before the moves the cluster had been running for several days at least.
The logs indicated the automatic token selection, and they joined the ring under a new automatically selected token.
Presumably, this violated consistency but at the time there was no live traffic to the cluster and we didn't confirm (put traffic on it after repair+cleanup).
I did look a little bit at the code in light of this but didn't see anything obvious, so I don't really know what the likely culprit is.
A potential complication was that seed nodes were moved without using the correct procedure of de-seeding them first. This was clearly wrong, but it is not obvious to me that it would cause other nodes to incorrectly bootstrap since a node should never bootstrap more than once if the local system tables say it's been bootstrapped.