Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.7.0
-
None
-
None
-
None
Description
It'd be nice to do a basic sanity check when starting a tablet copy session. Presently, when a tablet is created, it will acquire a data dir group that avoids dirs that were full at the time of the tablet's creation. That's good, but we should also get some info from the remote about how much WAL data, metadata, and data is going to be sent, and check, if there's no change to disk space using across data dirs or to the size of the source tablet, that the copy is possible. In other words, make sure amount to be copied is less than the available free space for wal and metadata, and that the amount of data to be copied is less than the space available across the dir group. If the check fails the new replica should be failed, which will encourage Kudu to re-replicate the tablet elsewhere.
Naturally, this isn't perfect, as more space may be used, or more space may be freed, over the course of the copy; also, the source tablet replica may gain additional WAL data to copy as it accepts writes. But it should help, and in particular should help prevent "domino" crashes where one server's wal dir fills, so it crashes, and re-replication crashes other servers as their wal drives fill (presumably because they are on similar hardware having done a similar workload).
A harder thing to address will be the corner case where the only option is to try to copy to a server with too little space. In this case it'd be better to surface the error aggressively in logs, etc, and perhaps back off on attempts, rather than endlessly make a tablet, start a copy, and fail a sanity check.
Attachments
Issue Links
- relates to
-
KUDU-2404 Mitigate effects of full disks
- Open