I think this is why it happens:
It could occur as follows. Suppose there's a partition 'P' assigned to brokers x and y; leaderAndIsr = y,
1. Controlled shutdown of broker x; leaderAndIsr -> y,
2. After above completes, kill -15 and then restart broker x
3. Immediately do a controlled shutdown of broker y; so now y is in the list of shutting down brokers.
Due to the above, x will not start its follower to 'P' on broker y.
Adding sufficient wait time between (2) and (3) seems to address the issue (in your script there's no sleep), but we should handle it properly in the shutdown code.
Will think about a fix for that.