TL;DR - after a topic is created, and at least one broker in the ISR is restarted, the ISR reported by the TopicMetadataResponse is incorrect.
Specific steps to repro:
- Download 0.8.1 Kafka
- Copy server.properties twice into server1.properties and server2.properties (attached) - basically just ports and log paths changed to allow brokers to co-exist
- Start zookeper using "sh bin/zookeeper-server-start.sh config/zookeper.properties"
- Start broker1: 'sh bin/kafka-server-start.sh config/server1.properties"
- Start broker2: 'sh bin/kafka-server-start.sh config/server2.properties"
- Create a new topic: "sh bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic test --replication-factor 2 --partitions 3"
- Examine topic state: "sh bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic test" - note that all ISRs are of length 2
- Run the attached Scala code that uses TopicMetadataRequest to exmaine topic state. Observer that all ISRs are of length 2 and match the information output by the script
- Shut down broker2 (simply hit Cntrl-C in the terminal), wait 5-10 seconds
- Restart broker 2 using the original command
- Check the status of the topic again. Observe that the leader for all topics is 0 (as expected), and all ISRs contain both brokers (as expected)
- Run the attached Scala snippet again.
- The ISR information are of length 2
- ALL ISRs contain just broker 0
NOTE: depending on how long broker 2 was down, sometimes some ISRs will contain the full list, but shutting it down for 15+ secs seem to always yield consistent repro
Basically it appears that brokers have incorrect ISR information for the metadata cache.
Our production servers exhibit the same problem - after a topic gets created everything looks fine, but as brokers get restarted, ISR reported by the brokers is wrong, whereas the one in ZK appears to report the truth (it shrinks as brokers get shut down and grows back up after they get restarted)
I'm not sure if this has wider impact on the functioning of the cluster - bad metadata information is bad - but so far there has been no evidence of that