With this design, Incremental and Full backup can't be run in parallel and leading to degraded RPO's in case Full backup is of longer duration esp for large tables.
Expectation: Say you have a big table with 10 TB and your RPO is 60 minutes and you are allowed to ship the remote backup with 800 Mbps. And you are allowed to take Full Backups once in a week and rest of them should be incremental backups
Shortcoming: With the above design, one can't run parallel backups and whenever there is a full backup running (which takes roughly 25 hours) you are not allowed to take incremental backups and that would be a breach in your RPO.
Proposed Solution: Barring some critical sections such as modifying state of the backup on meta tables, others can happen parallelly. Leaving incremental backups to be able to run based on older successful full / incremental backups and completion time of backup should be used instead of start time of backup for ordering. I have not worked on the full redesign, and will be doing so if this proposal seems acceptable for the community.
With one backup at a time, it fails easily for a multi-tenant system. This poses following problems
- Admins will not be able to achieve required RPO's for their tables because of dependence on other tenants present in the system. As one tenant doesn't have control over other tenants' table sizes and hence the duration of the backup
- Management overhead of setting up a right sequence to achieve required RPO's for different tenants could be very hard.
Proposed Solution: Same as previous proposal
Incremental backup works on WAL's and org.apache.hadoop.hbase.backup.master.BackupLogCleaner ensures that WAL's are never cleaned up until the next backup (Full / Incremental) is taken. This poses following problem
- WAL's can grow unbounded in case there are transient problems like backup site facing issues or anything else until next backup scheduled goes successful
Proposed Solution: I can't think of anything better, but I see this can be a potential problem. Also, one can force full backup if required WAL files are missing for whatever other reasons not necessarily mentioned above.