Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-5758

Reassigning a topic's partitions can adversely impact other topics

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.1.1
    • Fix Version/s: 1.0.0
    • Component/s: core
    • Labels:

      Description

      We've noticed that reassigning a topic's partitions seems to adversely impact other topics. Specifically, followers for other topics fall out of the ISR.

      While I'm not 100% sure about why this happens, the scenario seems to be as follows:

      1. Reassignment is manually triggered on topic-partition X-Y, and broker A (which used to be a follower for X-Y) is no longer a follower.
      2. Broker A makes `FetchRequest` including topic-partition X-Y to broker B, just after the reassignment.
      3. Broker B can fulfill the `FetchRequest`, but while trying to do so it tries to record the position of "follower" A. This fails, because broker A is no longer a follower for X-Y (see exception below).
      4. The entire `FetchRequest` request fails, and broker A's other followed topics start falling behind.
      5. Depending on the length of the reassignment, this sequence repeats.

      In step 3, we see exceptions like:

      Error when handling request Name: FetchRequest; Version: 3; CorrelationId: 46781859; ClientId: ReplicaFetcherThread-0-1001; ReplicaId: 1006; MaxWait: 500 ms; MinBytes: 1 bytes; MaxBytes:10485760 bytes; RequestInfo: 
      
      <LOTS OF PARTITIONS>
      
      kafka.common.NotAssignedReplicaException: Leader 1001 failed to record follower 1006's position -1 since the replica is not recognized to be one of the assigned replicas 1001,1004,1005 for partition [topic_being_reassigned,5].
      at kafka.cluster.Partition.updateReplicaLogReadResult(Partition.scala:249)
      	at kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:923)
      	at kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:920)
      	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
      	at kafka.server.ReplicaManager.updateFollowerLogReadResults(ReplicaManager.scala:920)
      	at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:481)
      	at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:534)
      	at kafka.server.KafkaApis.handle(KafkaApis.scala:79)
      	at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
      	at java.lang.Thread.run(Thread.java:745)
      

      Does my assessment make sense? If so, this behaviour seems problematic. A few changes that might improve matters (assuming I'm on the right track):

      1. `FetchRequest` should be able to return partial results
      2. The broker fulfilling the `FetchRequest` could ignore the `NotAssignedReplicaException`, and return results without recording the not-any-longer-follower position.

      This behaviour was experienced with 0.10.1.1, although looking at the changelogs and the code in question, I don't see any reason why it would have changed in later versions.

      Am very interested to have some discussion on this. Thanks!

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ijuma Ismael Juma
                Reporter:
                dwvangeest David van Geest
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: