Affects Version/s: None
Fix Version/s: 2.1.0
Kafka controller may fail to function properly (even after repeated controller movement) due to the following sequence of events:
- User requests topic deletion
- Controller A deletes the partition znode
- Controller B becomes controller and reads the topic znode
- Controller A deletes the topic znode and remove the topic from the topic deletion znode
- Controller B reads the partition znode and topic deletion znode
- According to controller B's context, the topic znode exists, the topic is not listed for deletion, and some partition is not found for the given topic. Then controller B will create topic znode with empty data (i.e. partition assignment) and create the partition znodes.
- All controller after controller B will fail because there is not data in the topic znode.
The long term solution is to have a way to prevent old controller from writing to zookeeper if it is not the active controller. One idea is to use the zookeeper multi API (See https://zookeeper.apache.org/doc/r3.4.3/api/org/apache/zookeeper/ZooKeeper.html#multi(java.lang.Iterable)) such that controller only writes to zookeeper if the zk version of the controller epoch znode has not been changed.
The short term solution is to let controller reads the topic deletion znode first. If the topic is still listed in the topic deletion znode, then the new controller will properly handle partition states of this topic without creating partition znodes for this topic. And if the topic is not listed in the topic deletion znode, then both the topic znode and the partition znodes of this topic should have been deleted by the time the new controller tries to read them.