Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-25784

Support for Parallel Backups enabling multi tenancy with rsgroups

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Umbrella
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • None
    • None
    • backup&restore

    Description

      Existing Design

       

      Problem 1: 
      With this design, Incremental and Full backup can't be run in parallel and leading to degraded RPO's in case Full backup is of longer duration esp for large tables.
       
      Example: 
      Expectation: Say you have a big table with 10 TB and your RPO is 60 minutes and you are allowed to ship the remote backup with 800 Mbps. And you are allowed to take Full Backups once in a week and rest of them should be incremental backups
       
      Shortcoming: With the above design, one can't run parallel backups and whenever there is a full backup running (which takes roughly 25 hours) you are not allowed to take incremental backups and that would be a breach in your RPO. 
       
      Proposed Solution: Barring some critical sections such as modifying state of the backup on meta tables, others can happen parallelly. Leaving incremental backups to be able to run based on older successful full / incremental backups and completion time of backup should be used instead of start time of backup for ordering. I have not worked on the full redesign, and will be doing so if this proposal seems acceptable for the community.
       
      Problem 2:
      With one backup at a time, it fails easily for a multi-tenant system. This poses following problems

      • Admins will not be able to achieve required RPO's for their tables because of dependence on other tenants present in the system. As one tenant doesn't have control over other tenants' table sizes and hence the duration of the backup
      • Management overhead of setting up a right sequence to achieve required RPO's for different tenants could be very hard.

      Proposed Solution: Same as previous proposal
       
      Problem 3: 
      Incremental backup works on WAL's and org.apache.hadoop.hbase.backup.master.BackupLogCleaner ensures that WAL's are never cleaned up until the next backup (Full / Incremental) is taken. This poses following problem

      • WAL's can grow unbounded in case there are transient problems like backup site facing issues or anything else until next backup scheduled goes successful
        Proposed Solution: I can't think of anything better, but I see this can be a potential problem. Also, one can force full backup if required WAL files are missing for whatever other reasons not necessarily mentioned above. 
         

      Proposed Design.

      Attachments

        1. proposed_design.png
          136 kB
          Mallikarjun
        2. existing_design.png
          135 kB
          Mallikarjun

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            rda3mon Mallikarjun
            rda3mon Mallikarjun

            Dates

              Created:
              Updated:

              Slack

                Issue deployment