Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-15353

Empty ISR returned from controller after AlterPartition request

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 3.5.0, 3.5.1
    • 3.6.0, 3.5.2
    • core
    • None

    Description

      In KIP-903, (more specifically this PR), we bumped the AlterPartitionRequest version to 3 to use `NewIsrWithEpochs` field instead of `NewIsr` one. And when building the request for older version, we'll manually convert/downgrade the request into the older version for backward compatibility here, to extract ISR info from `NewIsrWithEpochs` and then fill in the `NewIsr` field, and then clear the `NewIsrWithEpochs` field.

       

      The problem is, when the AlterPartitionRequest sent out for the first time, if there's some transient error (ex: NOT_CONTROLLER), we'll retry. On the retry, we'll build the AlterPartitionRequest again. But this time, the request data is the one that already converted above. At this point, when we try to extract the ISR from `NewIsrWithEpochs`, we'll get empty. So, we'll send out an AlterPartition request with empty ISR, and impacting the kafka availability.

       

      From the log, I can see this:

      [2023-08-16 03:57:55,122] INFO [Partition test_topic-1 broker=3] ISR updated to  (under-min-isr) and version updated to 9 (kafka.cluster.Partition)
      ...
      [2023-08-16 03:57:55,157] ERROR [ReplicaManager broker=3] Error processing append operation on partition test_topic-1 (kafka.server.ReplicaManager)org.apache.kafka.common.errors.NotEnoughReplicasException: The size of the current ISR Set() is insufficient to satisfy the min.isr requirement of 2 for partition test_topic-1 

       

      Impact:

      This will happen when users trying to upgrade from versions < 3.5.0 to 3.5.0 or later. During the rolling upgrade, there will be some nodes in v3.5.0, and some are not. So, for the node in v3.5.0 will try to build an old version of AlterPartitionRequest. And then, if it happen to have some transient error during the AlterPartitionRequest send, the ISR will be empty and no producers will be able to write data to the partitions.

      Attachments

        Issue Links

          Activity

            People

              calvinliu Calvin Liu
              showuon Luke Chen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: