Description
We've seen an issue wherein a tablet server crashed because of disk space issues. The thing is, the tablet server itself had space, but there were a number of disks that were full.
W0726 10:50:58.608566 41367 tablet_replica_mm_ops.cc:144] T d29679efebf94ccb9ed8de7daa44f3ef P 649f3f936e204410a62156f322ac6f90: failed to flush MRS: IO error: Failed to open DiskRowSet for flush: Unable to open output file for column cluster_id[string NOT NULL]: No directories available to add to d29679efebf94ccb9ed8de7daa44f3ef's directory group (11 dirs total, 4 full, 0 failed). (error 28) F0726 10:50:58.608582 41367 tablet_replica_mm_ops.cc:145] Check failed: tablet->HasBeenStopped() FlushMRS failure is only allowed if the tablet is stopped first
Note that the error message is a red herring: the failure really came from selecting a directory to place a container, not from selecting a directory to the directory group.
There were 4 full disks; presumably the tablet had a default directory group size of 3, and all of its directories were full.
It would be nice for directory groups to be dynamically resized as needed. If getting a directory for block placement yields an ENOSPC, we should consider adding a directory to the directory group based on available space or based on the number of replicas in the remaining directories.