Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
1.2.0
-
None
Description
Currently, a failure during tablet copy may leave behind a number of different things:
- Downloaded superblock (if the failure falls after TabletCopyClient::Start())
- Downloaded data blocks (if the failure falls during TabletCopyClient::FetchAll())
- Downloaded WAL segments (if the failure falls during TabletCopyClient::FetchAll())
- Downloaded cmeta file (if the failure falls during TabletCopyClient::Finish())
The next time the tserver starts, it'll see that this tablet's state is still TABLET_DATA_COPYING and will tombstone it. That takes care of #1, #3, and #4 (well, it leaves the cmeta file behind as the tombstone, but that's intentional).
Unfortunately, all data blocks are orphaned, because the on-disk superblock has no record of the new blocks, and so they aren't deleted.
We're already tracking a general purpose GC mechanism for data blocks in KUDU-829, but I think this separate JIRA for describing the problem with tablet copy is useful, if only as a reference for users.
Separately, it may be worth addressing these issues for failures that don't result in tserver crashes, such as intermittent network outages between tservers. A long lived tserver won't GC for some time, and it'd be nice to reclaim the disk space used by these orphaned objects in the interim, not to mention that implementing this kind of "GC" for data blocks is a lot easier than a general purpose GC.