Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-15230

ApiVersions data between controllers is not reliable



    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Duplicate
    • None
    • None
    • None
    • None


      While testing ZK migrations, I noticed a case where the controller was not starting the migration due to the missing ApiVersions data from other controllers. This was unexpected because the quorum was running and the followers were replicating the metadata log as expected. After examining a heap dump of the leader, it was in fact the case that the ApiVersions map of NodeApiVersions was empty.


      After further investigation and offline discussion with jsancio, we realized that after the initial leader election, the connection from the Raft leader to the followers will become idle and eventually timeout and close. This causes NetworkClient to purge the NodeApiVersions data for the closed connections.


      There are two main side effects of this behavior: 

      1) If migrations are not started within the idle timeout period (10 minutes, by default), then they will not be able to be started. After this timeout period, I was unable to restart the controllers in such a way that the leader had active connections with all followers.

      2) Dynamically updating features, such as "metadata.version", is not guaranteed to be safe


      There is a partial workaround for the migration issue. If we set "
      connections.max.idle.ms" to -1, the Raft leader will never disconnect from the followers. However, if a follower restarts, the leader will not re-establish a connection.
      The feature update issue has no safe workarounds.


        Issue Links



              cmccabe Colin McCabe
              davidarthur David Arthur
              0 Vote for this issue
              5 Start watching this issue