Currently, if a disk fails at runtime, its data directory will be marked as "failed" in memory, all tablets will be marked "failed" and re-replicated elsewhere, and life will move on, until the Kudu admins want to fix the server, at which point, they would need to bring down the server, replace the bad disk, run the `kudu fs update_dirs` tool to adopt the new disk and forget about the old data directory, and then start up again. This process can be slow, particularly since server startup can be slow, depending on the amount of data on a server.
As implemented today, no IO should be going to a failed data directory anyway, so the next logical extension of this would be to allow users to replace such failed directories while the server is up, removing the old data dir in memory, adding a new one with a new UUID, and updating the path instance metadata files on disk to reflect the change.
A few considerations should be taken into account in implementing this:
- Should we support removals of directories (vs replacements)? My feeling is no, since this would break further usages of the same `fs_data_dirs` configurations.
- Writing the PIMFs may be challenging. These files would need to be rewritten to adopt a new, empty disk. Writing to multiple files across multiple disks may be messy, and doing this while the server is online only exacerbates the problem. A reasonable amount of error handling should be scoped out.
- We should be careful in picking when a hot-swap is actually viable. E.g. if a hot-swap is requested and the data dir isn't bad, we shouldn't do anything, etc. Or perhaps a user may want to request a forced "failing" of a data directory in preparation for a swap.