Description
As we can see in these source codes – ZkSecurityMigrator.scala#L226
ZkSecurityMigrator checks and sets acl recursively for each path in SecureRootPaths. And /controller is also in SecureRootPaths.
As we can predicted, zkClient.makeSurePersistentPathExists() will create /controller node if /controller is not existed.
/controller is a EPHEMERAL node for controller election, but makeSurePersistentPathExists() will create a PERSISTENT node with null data.
If that happens, null data will cause a NPE, and the controller cannot be elected, kafka cluster will be unavailable .
In addition, a PERSISTENT node doesn't disappear automatically, we have to delete it manually to fix the problem.
PERSISTENT /controller node with null data in zk:
[zk: localhost:2181(CONNECTED) 16] get /kafka/controller
null
cZxid = 0x1100002284
ctime = Tue Dec 03 18:37:26 CST 2019
mZxid = 0x1100002284
mtime = Tue Dec 03 18:37:26 CST 2019
pZxid = 0x1100002284
cversion = 0
dataVersion = 0
aclVersion = 1
ephemeralOwner = 0x0
dataLength = 0
numChildren = 0
Normal /controller node in zk:
[zk: localhost:2181(CONNECTED) 21] get /kafka/controller {"version":1,"brokerid":1001,"timestamp":"1575370170528"} cZxid = 0x11000023e1 ctime = Tue Dec 03 18:49:30 CST 2019 mZxid = 0x11000023e1 mtime = Tue Dec 03 18:49:30 CST 2019 pZxid = 0x11000023e1 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x16ecb572df50021 dataLength = 57 numChildren = 0
NPE in controller.log :
[2019-11-21 15:02:41,276] INFO [ControllerEventThread controllerId=1002] Starting (kafka.controller.ControllerEventManager$ControllerEventThread) [2019-11-21 15:02:41,282] ERROR [ControllerEventThread controllerId=1002] Error processing event Startup (kafka.controller.ControllerEventManager$ControllerEventThread) java.lang.NullPointerException at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:857) at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2572) at kafka.utils.Json$.parseBytes(Json.scala:62) at kafka.zk.ControllerZNode$.decode(ZkData.scala:56) at kafka.zk.KafkaZkClient.getControllerId(KafkaZkClient.scala:902) at kafka.controller.KafkaController.kafka$controller$KafkaController$$elect(KafkaController.scala:1199) at kafka.controller.KafkaController$Startup$.process(KafkaController.scala:1148) at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:86) at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:86) at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:86) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:85) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
So, I submit a PR that ZkSecurityMigrator will not handle /controller node when /controller is not existed.
This bug seems to affect all versions, please review and merge the PR as soon as possible.
Thanks!
Attachments
Issue Links
- links to