We ran into this issue on our production cluster and had to manually remove the broker and enable unclean leader elections to get the cluster working again. Ideally, Kafka itself could handle network partitions without manual intervention.
The issue is reproducible with the following cross datacenter Kafka cluster setup:
DC 1: Kafka brokers + ZK nodes
DC 2: Kafka brokers + ZK nodes
DC 3: Kafka brokers + ZK nodes
Introduce a network partition on a Kafka broker (brokerA) in DC 1 where it cannot reach any hosts (brokers and ZK nodes) in the other 2 datacenters. The cluster goes into a state where partitions that brokerA is a leader for will only contain brokerA in its ISR. Since brokerA is still reachable by ZK nodes in DC 1, it still shows up when querying ZK. The controller thinks brokerA is still up and does not elect new leaders for partitions that brokerA is a leader for. This causes all those partitions to be down until brokerA is back or completely removed from the cluster (in which case unclean leader election can elect new leaders for those partitions).
A faster recovery scenario could be for a majority of hosts (zk nodes?) to realize that brokerA is unreachable, and mark it as down so elections for partitions it is a leader for could be triggered. This avoids waiting indefinitely for the broker to come back or taking action to remove the broker from the cluster.