[HBASE-28697] Don't clean bulk load system entries until backup is complete - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.6.0
Fix Version/s: 2.7.0, 3.0.0-beta-2, 2.6.1
Component/s: None
Labels:
- pull-request-available

Description

I've been thinking through the incremental backup order of operations, and I think we delete rows from the bulk loads system table too early and, consequently, make it possible to produce a "successful" incremental backup that is missing bulk loads.

To summarize the steps here, starting in IncrementalTableBackupCilent#execute:

We take an incremental backup of the WALs generated since the last backup
We ensure any bulk loads done since the last backup are appropriately represented in the new backup by going through the system table and copying the appropriate files to the backup directory
We delete all of the system table rows which told us about these bulk loads
We generate a backup manifest and mark the backup as complete

If we began deleting any of the system table rows regarding bulk loads, but fail in steps 3 and 4 before we are able to mark the backup as complete, then we'll be in a precarious spot. If we retry an incremental backup then it may succeed, but it would not know to persist the bulk loaded files for which we have already deleted system table references.

We could consider this issue an extension or replacement of https://issues.apache.org/jira/browse/HBASE-28084 in some ways, depending on what solution we land on. I think that we could fix this specific issue by reordering the bulk load table cleanup, but there will always be gotchas like this. Maybe it is simpler to require that the next backup be a full backup after any incremental failure.