Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-16061 KRaft JBOD follow-ups and improvements
  3. KAFKA-16234

Log directory failure re-creates partitions in another logdir automatically

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 3.7.0
    • 3.8.0, 3.7.1
    • jbod
    • None

    Description

      With KAFKA-16157 we made changes in HostedPartition.Offline enum variant to embed Partition object. Further, ReplicaManager::getOrCreatePartition tries to compare the old and new topicIds to decide if it needs to create a new log.

      The getter for Partition::topicId relies on retrieving the topicId from log field or logManager.currentLogs. The former is set to None when a partition is marked offline and the key for the partition is removed from the latter by LogManager::handleLogDirFailure. Therefore, topicId for a partitioned marked offline always returns None and new logs for all partitions in a failed log directory are always created on another disk.

      The broker will fail to restart after the failed disk is repaired because same partitions will occur in two different directories. The error does however inform the operator to remove the partitions from the disk that failed which should help with broker startup.

      We can avoid this with KAFKA-16212 but in the short-term, an immediate solution can be to have Partition object accept Option[TopicId] in it's constructor and have it fallback to log or logManager if it's unset.

      Attachments

        Issue Links

          Activity

            People

              omnia_h_ibrahim Omnia Ibrahim
              gnarula Gaurav Narula
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: