[KAFKA-7241] Reassignment partitiona to non existent broker - ASF JIRA

Details

Type: Task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.1.1
Fix Version/s: None
Component/s: admin
Labels:
None

Description

There is the trouble with starting reassignment partitions process to non existent broker.

Kafka cluster has 3 brokers with ids - 1, 2, 3.

We try to reassign some partitions to another broker(e.g. with id=4) and at last we cath up situation when reassign task does not stop at all. We can't start others reassign tasks before finish that task.

Details:

We have broker list before reassignment partitions task is started.

[zk: grid1219:3185(CONNECTED) 0] ls /brokers/ids
[1, 2, 3]

Admin path is:

[zk: grid1219:3185(CONNECTED) 1] ls /admin
[delete_topics]
[zk: grid1219:3185(CONNECTED) 2] get /admin/delete_topics
null
cZxid = 0xe
ctime = Fri Aug 03 08:04:25 MSK 2018
mZxid = 0xe
mtime = Fri Aug 03 08:04:25 MSK 2018
pZxid = 0xe
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 0

There is one topic created with 20 partitions and 3 replication-factor. Reassign partitions json is here reassignment_json.txt . We are write json to path /admin/reassign_partitions and after that partition reassignment is started.
We can see the result of reassignment process in kafka controller logs(full log is here - kafka-logs.zip ):

[2018-08-03 08:52:21,329] INFO [Controller id=1] Handling reassignment of partition test-15 to new replicas 4,1,2 (kafka.controller.KafkaController)
[2018-08-03 08:52:21,329] INFO [Controller id=1] New replicas 4,1,2 for partition test-15 being reassigned not yet caught up with the leader (kafka.controller.KafkaController)
[2018-08-03 08:52:21,330] INFO [Controller id=1] Updated assigned replicas for partition test-15 being reassigned to 4,1,2,3 (kafka.controller.KafkaController)
[2018-08-03 08:52:21,330] DEBUG [Controller id=1] Updating leader epoch for partition test-15 (kafka.controller.KafkaController)
[2018-08-03 08:52:21,331] INFO [Controller id=1] Updated leader epoch for partition test-15 to 1 (kafka.controller.KafkaController)
[2018-08-03 08:52:21,331] WARN [Channel manager on controller 1]: Not sending request (type=LeaderAndIsRequest, controllerId=1, controllerEpoch=1, partitionStates={test-15=PartitionState(controllerEpoch=1, leader=3, leaderEpoch=1, isr=3,1[2018-08-03 08:52:21,331] WARN [Channel manager on controller 1]: Not sending request (type=LeaderAndIsRequest, controllerId=1, controllerEpoch=1, partitionStates={test-15=PartitionState(controllerEpoch=1, leader=3, leaderEpoch=0, isr=3,1[2018-08-03 08:52:21,331] INFO [Controller id=1] Waiting for new replicas 4,1,2 for partition test-15 being reassigned to catch up with the leader (kafka.controller.KafkaController)
[2018-08-03 08:52:21,332] INFO [Controller id=1] Handling reassignment of partition test-2 to new replicas 2,1,4 (kafka.controller.KafkaController)
[2018-08-03 08:52:21,332] INFO [Controller id=1] New replicas 2,1,4 for partition test-2 being reassigned not yet caught up with the leader (kafka.controller.KafkaController)
[2018-08-03 08:52:21,333] INFO [Controller id=1] Updated assigned replicas for partition test-2 being reassigned to 2,1,4,3 (kafka.controller.KafkaController)
[2018-08-03 08:52:21,333] DEBUG [Controller id=1] Updating leader epoch for partition test-2 (kafka.controller.KafkaController)
[2018-08-03 08:52:21,333] INFO [Controller id=1] Updated leader epoch for partition test-2 to 1 (kafka.controller.KafkaController)
[2018-08-03 08:52:21,334] WARN [Channel manager on controller 1]: Not sending request (type=LeaderAndIsRequest, controllerId=1, controllerEpoch=1, partitionStates={test-2=PartitionState(controllerEpoch=1, leader=2, leaderEpoch=1, isr=2,1,[2018-08-03 08:52:21,334] WARN [Channel manager on controller 1]: Not sending request (type=LeaderAndIsRequest, controllerId=1, controllerEpoch=1, partitionStates={test-2=PartitionState(controllerEpoch=1, leader=2, leaderEpoch=0, isr=2,1,[2018-08-03 08:52:21,334] INFO [Controller id=1] Waiting for new replicas 2,1,4 for partition test-2 being reassigned to catch up with the leader (kafka.controller.KafkaController)
[2018-08-03 08:52:21,337] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-14 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,337] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-6 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,338] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-17 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,338] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-11 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,339] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-10 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,339] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-19 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,340] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-0 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,340] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-7 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,341] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-18 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,341] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-5 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,342] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-8 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,342] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-1 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,343] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-13 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,343] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-4 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,344] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-16 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,344] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-9 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,345] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-3 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,345] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-12 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,346] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-15 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,346] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-2 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)

After reassignment process is finished znode /admin/reassign_partitions contains value from file reassignment_partitions_path_after_finish_work.txt .

We have some reassign task in znode /admin/reassign_partitions. And we can't start other reassign tasks before previous task is not finished.

We need some legal mechanism to catch up such situations and stop such reassign tasks.
For example we need timeout parameter to clean reassign tasks after which reassign task of one partition is dropped from znode with warning in logs.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

kafka-logs.zip
03/Aug/18 10:52
114 kB
Igor Martemyanov
reassignment_json.txt
03/Aug/18 10:59
2 kB
Igor Martemyanov
reassignment_partitions_path_after_finish_work.txt
03/Aug/18 10:59
2 kB
Igor Martemyanov

Reassignment partitiona to non existent broker

Details

Description

Attachments

Attachments

Activity

People

Dates