There are intermittent test failures in DynamicBrokerReconfigurationTest when brokers are restarted. The test uses ephemeral ports and hence ports after server restart are not the same as the ports before restart. The tests rely on metadata refresh on producers, consumers and admin clients to obtain new server ports when connections fail. This works with producers and consumers, but results in intermittent failures with admin client because refresh is not triggered.
There are a couple of issues in AdminClient:
- Unlike producers and consumers, adminClient does not request metadata update when connection to a broker fails. This is particularly bad if controller goes down. Controller is used for various requests like createTopics and describeTopics. If controller goes down and adminClient.describeTopics() is invoked, adminClient sends the request to the old controller. If the connection fails, it keeps retrying with the same address. Metadata refresh is never triggered. The request times out after 2 minutes by default, metadata is not refreshed for 5 minutes by default. We should refresh metadata whenever connection to a broker fails.
- Admin client requests are always retried on the same node. In the example above, if controller goes down and a new controller is elected, it will be good if the retried request is sent to the new controller. Otherwise we are just blocking the call for 2 minutes with a lot of retries that would never succeed.