<< For '...incremental backups at the Stage 1 (RBU) level', won't the time between step between b and d be 'large' and during the copy time, the list of files could change on you; i.e. when you go to copy a file, it maybe have been removed because it'd been compacted. What do you do in this case? (Your list may not included the compacted file)? >>
We look for the deleted files in .Trash and reclaim. If they are not present, we fail the backup for the region. The backup job runs in loops - the first loop starts out with all regions. The failed regions are output and the second loop works only on the failed regions. The number of loops is configurable - we have defaulted at 5.
<< For "a.The backups rely on the clocks across the various region-servers for determining the point in time to which the edits are re-played", so, say a server is lagging the others by a good bit? When replaying the edits, you'd replay edits from when this lagging server said the backup began? >>
No, right now we just subtract a configurable amount of time (say 5 mins) to the start time of the MR job to keep things simple. We could totally do what you say as an enhancement.
<< How will you know which hlogs to replay? You'll open it and look at first and last edits in the file? Or should we write out metadata files for hlogs? Or is it enough relying on hdfs modtime? >>
The hlog files are of the format hlog.TIMESTAMP, TIMESTAMP is time when log is created. We look at this time to determine the file set. We need all files where TIMESTAMP > start time and TIMESTAMP < finish time. We need the latest file where TIMESTAMP < start time.