Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-1987

Potential race condition in partition creation

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • controller
    • None

    Description

      I am finding that there appears to be a race condition when creating partitions, with replication factor 2 or higher, between the creation of the partition on the leader and the follower. What appears to be happening is that the follower is processing the command to create the partition before the leader does, and when the follower starts the replica fetcher, it fails with an UnknownTopicOrPartitionException.

      The situation is that I am creating a large number of partitions on a cluster, preparing it for data being mirrored from another cluster. So there are a sizeable number of create and alter commands being sent sequentially. Eventually, the replica fetchers start up properly. But it seems like the controller should issue the command to create the partition to the leader, wait for confirmation, and then issue the command to create the partition to the followers.

      2015/02/26 21:11:50.413 INFO [LogManager] [kafka-request-handler-12] [kafka-server] [] Created log for partition [topicA,30] in /path_to/i001_caches with properties

      {segment.index.bytes -> 10485760, file.delete.delay.ms -> 60000, segment.bytes -> 268435456, flush.ms -> 10000, delete.retention.ms -> 86400000, index.interval.bytes -> 4096, retention.bytes -> -1, min.insync.replicas -> 1, cleanup.policy -> delete, unclean.leader.election.enable -> true, segment.ms -> 43200000, max.message.bytes -> 1000000, flush.messages -> 20000, min.cleanable.dirty.ratio -> 0.5, retention.ms -> 86400000, segment.jitter.ms -> 0}

      .
      2015/02/26 21:11:50.418 WARN [Partition] [kafka-request-handler-12] [kafka-server] [] Partition [topicA,30] on broker 1551: No checkpointed highwatermark is found for partition [topicA,30]
      2015/02/26 21:11:50.418 INFO [ReplicaFetcherManager] [kafka-request-handler-12] [kafka-server] [] [ReplicaFetcherManager on broker 1551] Removed fetcher for partitions [topicA,30]
      2015/02/26 21:11:50.418 INFO [Log] [kafka-request-handler-12] [kafka-server] [] Truncating log topicA-30 to offset 0.
      2015/02/26 21:11:50.450 INFO [ReplicaFetcherManager] [kafka-request-handler-12] [kafka-server] [] [ReplicaFetcherManager on broker 1551] Added fetcher for partitions List([[topicA,30], initOffset 0 to broker id:1555,host:host1555.example.com,port:10251] )
      2015/02/26 21:11:50.615 ERROR [ReplicaFetcherThread] [ReplicaFetcherThread-0-1555] [kafka-server] [] [ReplicaFetcherThread-0-1555], Error for partition [topicA,30] to broker 1555:class kafka.common.UnknownTopicOrPartitionException
      2015/02/26 21:11:50.616 ERROR [ReplicaFetcherThread] [ReplicaFetcherThread-0-1555] [kafka-server] [] [ReplicaFetcherThread-0-1555], Error for partition [topicA,30] to broker 1555:class kafka.common.UnknownTopicOrPartitionException
      2015/02/26 21:11:50.618 ERROR [ReplicaFetcherThread] [ReplicaFetcherThread-0-1555] [kafka-server] [] [ReplicaFetcherThread-0-1555], Error for partition [topicA,30] to broker 1555:class kafka.common.UnknownTopicOrPartitionException
      2015/02/26 21:11:50.620 ERROR [ReplicaFetcherThread] [ReplicaFetcherThread-0-1555] [kafka-server] [] [ReplicaFetcherThread-0-1555], Error for partition [topicA,30] to broker 1555:class kafka.common.UnknownTopicOrPartitionException
      2015/02/26 21:11:50.621 ERROR [ReplicaFetcherThread] [ReplicaFetcherThread-0-1555] [kafka-server] [] [ReplicaFetcherThread-0-1555], Error for partition [topicA,30] to broker 1555:class kafka.common.UnknownTopicOrPartitionException
      2

      Attachments

        Activity

          People

            Unassigned Unassigned
            toddpalino Todd Palino
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: