Description
KIP-858 proposed that when a directory failure occurs after changing the assignment of a replica that's moved between two directories in the same broker, but before the future replica promotion completes, the broker should reassign the replica to inform the controller of its correct status. But this hasn't yet been implemented, and without it this failure may lead to indefinite partition unavailability.
Example scenario:
- A broker which leads partition P receives a request to alter the replica from directory A to directory B.
- The broker creates a future replica in directory B and starts a replica fetcher.
- Once the future replica first catches up, the broker queues a reassignment to inform the controller of the directory change.
- The next time the replica catches up, the broker briefly blocks appends and promotes the replica. However, before the promotion is attempted, directory A fails.
- The controller was informed that P in now in directory B before it received the notification that directory A has failed, so it does not elect a new leader, and as long as the broker is online, partition A remains unavailable.
Attachments
Issue Links
- links to