Details
-
Test
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Test failed at https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/1892/tests/ as well as in various local runs.
The test creates a topic, fences a broker, notes partition imbalance due to another broker taking over the partition the fenced broker lost, re-registers and unfences the fenced broker, sends AlterPartition for the lost partition adding the now unfenced broker back to its ISR, then waits for the partition imbalance to disappear.
The local failures seem to happen when the brokers (including the ones that never get fenced by the test) accidentally get fenced by losing their session due to reaching the (aggressively low for test purposes) session timeout.
The Cloudbees failure quoted above also seems to indicate that this happened:
...[truncated 738209 chars]... 23. (org.apache.kafka.controller.QuorumController:768) [2023-06-02 18:17:22,202] DEBUG [QuorumController id=0] Scheduling write event for maybeBalancePartitionLeaders because scheduled (DEFERRED), checkIntervalNs (OptionalLong[1000000000]) and isImbalanced (true) (org.apache.kafka.controller.QuorumController:1401) [2023-06-02 18:17:22,202] INFO [QuorumController id=0] Fencing broker 2 because its session has timed out. (org.apache.kafka.controller.ReplicationControlManager:1459) [2023-06-02 18:17:22,203] DEBUG [QuorumController id=0] handleBrokerFenced: changing partition(s): foo-0, foo-1, foo-2 (org.apache.kafka.controller.ReplicationControlManager:1750) [2023-06-02 18:17:22,203] DEBUG [QuorumController id=0] partition change for foo-0 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 2 -> -1, leaderEpoch: 2 -> 3, partitionEpoch: 2 -> 3 (org.apache.kafka.controller.ReplicationControlManager:157) [2023-06-02 18:17:22,204] DEBUG [QuorumController id=0] partition change for foo-1 with topic ID 033_QSX7TfitL4SDzoeR4w: isr: [2, 3] -> [3], leaderEpoch: 3 -> 4, partitionEpoch: 4 -> 5 (org.apache.kafka.controller.ReplicationControlManager:157) [2023-06-02 18:17:22,204] DEBUG [QuorumController id=0] partition change for foo-2 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 2 -> -1, leaderEpoch: 2 -> 3, partitionEpoch: 2 -> 3 (org.apache.kafka.controller.ReplicationControlManager:157) [2023-06-02 18:17:22,205] DEBUG append(batch=LocalRecordBatch(leaderEpoch=1, appendTimestamp=240, records=[ApiMessageAndVersion(PartitionChangeRecord(partitionId=0, topicId=033_QSX7TfitL4SDzoeR4w, isr=null, leader=-1, replicas=null, removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at version 0), ApiMessageAndVersion(PartitionChangeRecord(partitionId=1, topicId=033_QSX7TfitL4SDzoeR4w, isr=[3], leader=3, replicas=null, removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at version 0), ApiMessageAndVersion(PartitionChangeRecord(partitionId=2, topicId=033_QSX7TfitL4SDzoeR4w, isr=null, leader=-1, replicas=null, removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at version 0), ApiMessageAndVersion(BrokerRegistrationChangeRecord(brokerId=2, brokerEpoch=3, fenced=1, inControlledShutdown=0) at version 0)]), prevOffset=27) (org.apache.kafka.metalog.LocalLogManager$SharedLogData:253) [2023-06-02 18:17:22,205] DEBUG [QuorumController id=0] Creating in-memory snapshot 27 (org.apache.kafka.timeline.SnapshotRegistry:197) [2023-06-02 18:17:22,205] DEBUG [LocalLogManager 0] Node 0: running log check. (org.apache.kafka.metalog.LocalLogManager:512) [2023-06-02 18:17:22,205] DEBUG [QuorumController id=0] Read-write operation maybeFenceReplicas(451616131) will be completed when the log reaches offset 27. (org.apache.kafka.controller.QuorumController:768) [2023-06-02 18:17:22,208] INFO [QuorumController id=0] Fencing broker 3 because its session has timed out. (org.apache.kafka.controller.ReplicationControlManager:1459) [2023-06-02 18:17:22,209] DEBUG [QuorumController id=0] handleBrokerFenced: changing partition(s): foo-1 (org.apache.kafka.controller.ReplicationControlManager:1750) [2023-06-02 18:17:22,209] DEBUG [QuorumController id=0] partition change for foo-1 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 3 -> -1, leaderEpoch: 4 -> 5, partitionEpoch: 5 -> 6 (org.apache.kafka.controller.ReplicationControlManager:157)