ew more changes in this patch -
1. Changed leader and isr request to send the controller epoch that made the last change for leader/isr per partition. This is used by the broker to update the leader and isr path with the correct controller epoch. Each Partition object on a Kafka server will maintain the epoch of the controller that made the last leader/isr decision. If/when the broker changes the isr, it uses the correct value for the controller epoch, instead of using the currently active controller's epoch. Functionally, nothing bad will happen even if it uses the currently active controller's epoch (that is sent on every state change request), but semantically it will not quite be right to do so. This can happen when a previous controller has made the leader/isr decisions for partitions, while the newer controllers have merely re-published those decisions upon controller failover.
2. Changed the become controller procedure to resign as the current controller if it runs into any unexpected error/exception while making the state change to become controller. This is to ensure that the currently elected controller is actually serving as the controller.
3. LogRecoveryTest and LogText occasionally fail, but I believe they fail on our nightly build as well. Didn't attempt to fix those tests in this patch.
Regarding Jun's review -
>> 40. PartitionStateInfo: It seems that we need to send the controllerEpoc associated with this partition. Note that this epoc is different from the controllerEpoc in LeaderAndIsrRequest. The former is the epoc of the controller that last changed the leader or isr and will be used when broker updates the isr. The latter is the epoc of the controller that sends the request and will be used in ReplicaManager to decide which controller's decision to follow. We will need to change the controllerEpoc passed to makeLeader and makeFollower in ReplicaManager accordingly.
You raise a good point here. What I missed is initializing the controller epoch for each partition. There are 2 ways to initialize it 1. zookeeper read on startup 2. Active controller sending the controller epoch of the controller that last made a leader/isr decision for that partition. I'm guessing 2. might be better from a performance perspective.
>> 41. ReplicaManager: In stopReplicas() and becomeLeaderOrFollower(), it would be better to only update controllerEpoch when it's truly necessary, i.e., the new controllerEpoch is larger than the cached one (not equal). This is because updating a volatile variable is a bit expensive than updating a local variable since the update has to be exposed to other threads.
Not sure if this is a performance win. Volatile variables are never cached in memory registers. So a read needs to reload data from memory and write needs to write data back to memory. The if statement would need to access the same volatile variable, requiring it to go to memory anyways.
>> 2.1. incrementControllerEpoch(): If the controllerEpoc path doesn't exist, we create the path using the initial epoc version without using conditional update. It is possible for 2 controllers to execute this logic simultaneously and both get the initial epoc version. One solution is to make sure the controller epoc path exists during context initialization. Then we can always use conditional update here.
It is not possible for 2 clients to create the same zookeeper path. This is the simplest guarantee zookeeper provides. One of the writes will fail and that controller will abort its controller startup procedure. The larger problem here is not so much that one of the writes should fail, but we need to ensure that if the failed zk operation happens to be for the latest active controller, then it will abort its controller startup procedure and the old one will lose its zookeeper session anyways
>> 42.2. ControllerContext: We need to initialize controllerEpoc by reading from ZK. We also need to make sure that we subscribe to the controllerEpoc path first and then read its value from ZK for initialization.
The controller constructor is modified to initialize the controller epoch and zk version by reading from zk and then it subscribes to controller epoch's zk path.
>> 42.3. ControllerEpochListener: It's safer to set both the epoc and the ZK version using the value from ZkUtils.readData.
You're right and there is no perfect solution to this. Ideally, the zkclient API should change to expose the version since the underlying zookeeper API exposes it. The problem is that there will always be a window after the listener has fired and before the read returns when the controller's epoch could change. There will be another listener fired, though during each listener invocation, this problem would exist. The right way is to rely on the data the listener returns to the controller. But, with this change, at least the epoch and its version will correspond to the same epoch change, so its still better.