This approach of having a queue of state change requests that each replica acts upon, is something I'm leaning towards for all state changes.
There are 2 ways of making state changes in a system which uses ZK listeners -
1. Each server listens on various ZK paths, registers the same listeners, and follows the same code path to apply state changes to itself. Here, the state machine, is replicated on each server.
2. A highly-available co-ordinator listens of various ZK paths, registers ZK listeners, verifies system state and state transitions. Then issues state transition requests to the various replicas. Here, only the co-ordinator executes the state machine.
We have been down approach 1 earlier with the zookeeper consumer, and through experience, found that though, it seems simpler to design and implement at first, it turns into a fairly buggy and high operational overhead system. This is because that approach suffers from
1. herd effect
2. "split brain" problem.
3. In addition to these, it will be pretty complicated to perform upgrades on the state machine and can leave the cluster in an undefined state during upgrades.
4. Monitoring the state machine is a hassle, due to it being distributed in nature
Approach 2 ensures the state machine only on the co-ordinator, which itself is elected from amongst the brokers. This approach ensures that -
1. at any point of time, we can reason about the state of the entire cluster.
2. Only after the state is verified, can further state changes be applied. If verification fails, alerts can be triggered preventing the system from getting into an indefinite state.
3. A big advantage of this approach is easier upgrades to the state machine. It is true that, theoretically, state machine logic doesn't change much over time, but in reality, state machine changes would need upgrades, due to improvements in the logic or fixing code bugs.
4. Monitoring the state machine becomes much simpler
In general, both approaches are “doable”, but we need to weigh the cost of “patching” the code to make it work VS choosing a simple design that will be easy to maintain and monitor.
I would like to see a discussion on this fundamental design choice, before jumping to code and patches on
KAFKA-44 and KAFKA-45.