Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-7241

Reassignment partitiona to non existent broker

    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.1.1
    • None
    • admin
    • None

    Description

      There is the trouble with starting reassignment partitions process to non existent broker.

      Kafka cluster has 3 brokers with ids - 1, 2, 3.

      We try to reassign some partitions to another broker(e.g. with id=4) and at last we cath up situation when reassign task does not stop at all. We can't start others reassign tasks before finish that task.

      Details:

       

      We have broker list before reassignment partitions task is started.

      [zk: grid1219:3185(CONNECTED) 0] ls /brokers/ids
      [1, 2, 3]
      

      Admin path is:

      [zk: grid1219:3185(CONNECTED) 1] ls /admin
      [delete_topics]
      [zk: grid1219:3185(CONNECTED) 2] get /admin/delete_topics
      null
      cZxid = 0xe
      ctime = Fri Aug 03 08:04:25 MSK 2018
      mZxid = 0xe
      mtime = Fri Aug 03 08:04:25 MSK 2018
      pZxid = 0xe
      cversion = 0
      dataVersion = 0
      aclVersion = 0
      ephemeralOwner = 0x0
      dataLength = 0
      numChildren = 0
      

      There is one topic created with 20 partitions and 3 replication-factor. Reassign partitions json is here reassignment_json.txt . We are write json to path /admin/reassign_partitions and after that partition reassignment is started.
      We can see the result of reassignment process in kafka controller logs(full log is here - kafka-logs.zip ):

      [2018-08-03 08:52:21,329] INFO [Controller id=1] Handling reassignment of partition test-15 to new replicas 4,1,2 (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,329] INFO [Controller id=1] New replicas 4,1,2 for partition test-15 being reassigned not yet caught up with the leader (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,330] INFO [Controller id=1] Updated assigned replicas for partition test-15 being reassigned to 4,1,2,3 (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,330] DEBUG [Controller id=1] Updating leader epoch for partition test-15 (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,331] INFO [Controller id=1] Updated leader epoch for partition test-15 to 1 (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,331] WARN [Channel manager on controller 1]: Not sending request (type=LeaderAndIsRequest, controllerId=1, controllerEpoch=1, partitionStates={test-15=PartitionState(controllerEpoch=1, leader=3, leaderEpoch=1, isr=3,1[2018-08-03 08:52:21,331] WARN [Channel manager on controller 1]: Not sending request (type=LeaderAndIsRequest, controllerId=1, controllerEpoch=1, partitionStates={test-15=PartitionState(controllerEpoch=1, leader=3, leaderEpoch=0, isr=3,1[2018-08-03 08:52:21,331] INFO [Controller id=1] Waiting for new replicas 4,1,2 for partition test-15 being reassigned to catch up with the leader (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,332] INFO [Controller id=1] Handling reassignment of partition test-2 to new replicas 2,1,4 (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,332] INFO [Controller id=1] New replicas 2,1,4 for partition test-2 being reassigned not yet caught up with the leader (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,333] INFO [Controller id=1] Updated assigned replicas for partition test-2 being reassigned to 2,1,4,3 (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,333] DEBUG [Controller id=1] Updating leader epoch for partition test-2 (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,333] INFO [Controller id=1] Updated leader epoch for partition test-2 to 1 (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,334] WARN [Channel manager on controller 1]: Not sending request (type=LeaderAndIsRequest, controllerId=1, controllerEpoch=1, partitionStates={test-2=PartitionState(controllerEpoch=1, leader=2, leaderEpoch=1, isr=2,1,[2018-08-03 08:52:21,334] WARN [Channel manager on controller 1]: Not sending request (type=LeaderAndIsRequest, controllerId=1, controllerEpoch=1, partitionStates={test-2=PartitionState(controllerEpoch=1, leader=2, leaderEpoch=0, isr=2,1,[2018-08-03 08:52:21,334] INFO [Controller id=1] Waiting for new replicas 2,1,4 for partition test-2 being reassigned to catch up with the leader (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,337] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-14 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,337] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-6 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,338] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-17 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,338] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-11 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,339] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-10 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,339] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-19 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,340] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-0 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,340] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-7 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,341] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-18 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,341] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-5 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,342] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-8 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,342] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-1 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,343] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-13 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,343] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-4 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,344] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-16 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,344] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-9 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,345] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-3 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,345] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-12 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,346] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-15 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      [2018-08-03 08:52:21,346] INFO [Controller id=1] 2/3 replicas have caught up with the leader for partition test-2 being reassigned. Replica(s) 4 still need to catch up (kafka.controller.KafkaController)
      

      After reassignment process is finished znode /admin/reassign_partitions contains value from file reassignment_partitions_path_after_finish_work.txt .

      We have some reassign task in znode /admin/reassign_partitions. And we can't start other reassign tasks before previous task is not finished.

      We need some legal mechanism to catch up such situations and stop such reassign tasks.
      For example we need timeout parameter to clean reassign tasks after which reassign task of one partition is dropped from znode with warning in logs.

      Attachments

        1. kafka-logs.zip
          114 kB
          Igor Martemyanov
        2. reassignment_json.txt
          2 kB
          Igor Martemyanov
        3. reassignment_partitions_path_after_finish_work.txt
          2 kB
          Igor Martemyanov

        Activity

          People

            Unassigned Unassigned
            Igor Martemyanov Igor Martemyanov
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: