Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-9118

LogDirFailureHandler shouldn't use Zookeeper

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      As described in KIP-112:

      2. A log directory stops working on a broker during runtime
      
      - The controller watches the path /log_dir_event_notification for new znode.
      - The broker detects offline log directories during runtime.
      - The broker takes actions as if it has received StopReplicaRequest for this replica. More specifically, the replica is no longer considered leader and is removed from any replica fetcher thread. (The clients will receive a UnknownTopicOrPartitionException at this point)
      - The broker notifies the controller by creating a sequential znode under path /log_dir_event_notification with data of the format {"version" : 1, "broker" : brokerId, "event" : LogDirFailure}.
      - The controller reads the znode to get the brokerId and finds that the event type is LogDirFailure.
      - The controller deletes the notification znode
      - The controller sends LeaderAndIsrRequest to that broker to query the state of all topic partitions on the broker. The LeaderAndIsrResponse from this broker will specify KafkaStorageException for those partitions that are on the bad log directories.
      - The controller updates the information of offline replicas in memory and trigger leader election as appropriate.
      - The controller removes offline replicas from ISR in the ZK and sends LeaderAndIsrRequest with updated ISR to be used by partition leaders.
      - The controller propagates the information of offline replicas to brokers by sending UpdateMetadataRequest.
      

      Instead of the notification ZNode we should use a Kafka protocol that sends a notification message to the controller with the offline partitions. The controller then updates the information of offline replicas in memory and trigger leader election, then removes the replicas from ISR in ZK and sends a LAIR and an UpdateMetadataRequest.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                viktorsomogyi Viktor Somogyi-Vass
                Reporter:
                viktorsomogyi Viktor Somogyi-Vass
              • Votes:
                1 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: