The serial numbering of the files solution requires that checkpoints occur only at a edits split boundaries.
Yes, but since we can split edits at will, I don't think there's any problem just having the backupnode asking the active NN to roll whenver the BN would like to do a checkpoint. The nice thing about this is that an image file from the BN can be lined up exactly with the corresponding edit logs from the NN, etc.
The transaction ID one does not have that restriction but it does require that in order to detect a gap in edits one has to look inside the logs. The txId one can avoid that if we are prepared to rename the edits log when you split (roll) it (Ugh!)
Agreed re ugh! The renaming is the complexity we're trying to avoid, no?
The txId numbering scheme also has the advantage that multiple backups can roll and do checkpoints independently (we DONOT want to do that as it will confuse the operators â“ but it shows that the design is very robust.
I still think this is possible with sequential numbering. And I agree that not confusing operators is a key design goal for this JIRA. The whole image/edit log thing in normal operation should be an implementation detail, and when operators have to look at it they're usually very stressed out because a cluster is corrupt - so we want to make it very clear what's going on, and very hard to create any state that is unrecoverable.
I've started working on this patch and it's coming along nicely. The NN and secondary NN are working great, and just started on the BN/Checkpointer. Here's a brief overview of the design I'm going with - hopefully I will answer the above questions along the way.
The NN storage directories continue to be organized in the same way - either edits, images, or both. The difference is that each edits or fsimage file now has a suffix indicating its "roll index". For example, a newly formatted NN has the following contents:
- fsimage_0 - empty image
- edits_0_inprogress - the edit log currently being appended
When edits are rolled, the current 'edits_N_inprogress' file is "finalized" by renaming to simply edits_N. So, if we roll the edits of the above image, we end up with:
- fsimage_0 - same empty image
- edits_0 - any edits made before the roll
When an image is saved or uploaded via a checkpoint, the validity rule is as follows: any fsimage with roll index N must incorporate all edits from logs with a roll index less than N. So, if we enter safe mode and call saveNamespace on the above example, we end up with:
- fsimage_0 - original empty imagge
- edits_0 - edits before first roll
- edits_1 - edits before saveNamespace
- fsimage_2 - all edits from edits_0 and edits_1
- edits_2_inprogress - the edit log where new edits will be appended
Log Rolling Triggers
The following events can trigger a log roll:
- NN startup (see below)
- a secondary or backup node wants to begin a checkpoint
- an IOException has occurred on one of the current edit logs
- potentially we may find it useful to expose this as an admin function? (eg mysql offers a flush logs; command)
Log rolling behavior:
- The current edits_N_inprogress log is closed
- The current edits_N_inprogress log is renamed to edits_N in all valid edits directories.
- Any edits directories that previously had problems will be left with edits_N_inprogress (since we don't know whether all of the edits made it into that log before the roll, in fact they probably did not)
- The next edits_N+1_inprogress is opened in all directories, including an attempt to reopen any failed directories.
First we initiate log recovery:
- Across all edits directories, look for any edits_N_inprogress:
- If one is found, look for a finalized edits_N file in any other log directory
- If there is at least one finalized edits_N, then the edits_N_inprogress is likely corrupt – rename it to edits_N_corrupt (or delete it if we are less cautious)
- If there are no finalized edits_N files, then the NN crashed while we were writing log index N. Initiate recovery process across all edits_N_inprogress:
- Currently this isn't fancy - I just pick one. However, we could scan each of the logs for OP_INVALID and find the longest one, ensure that they have the same length, etc (eg one log must not have caught the last edit, or been truncated, etc)
- This is very simple to do since across all directories (including secondaries) edits_M for any M should be identical!
- After we've determined the correct log(s), finalize it and remove the others
Next, find the fsimage_N with the highest N across all image directories.
Then, find the edits_M with the highest M across all edits directories.
For safety, we check that there exists an edits_X for all X between N and M inclusive.
We then start up the NN by the following sequence:
- load fsimage_N
- for each M through N inclusive, load edits_N
- if we loaded any edits, save fsimage_N+1
- open edits_inprogress_N+1
- Checkpoint Signature is modified to include the latest image index and the current log index in progress.
- Checkpointing node issues beginCheckpoint to NN
- NN rolls edit logs, and returns a checkpoint signature that includes the latest stored fsimage_N, as well as the index of the log it just rolled to
- Image transfer servlet is augmented to allow the downloader to specify which image or edits file to download
- Checkpointer downloads fsimage_N and edits_N through edits_M (where M is the new finalized edit log from the roll)
- Checkpointer saves local fsimage_M+1, and uploads to NN
- NN validation of the checkpoint signature is much simpler - just needs to make sure it came from the same filesystem, check any security tokens, etc. The old fstime and editstime constructs are no longer necessary since it's all encapsulated in the index numbers. For extra safety we can easily add some checksum or log length info to the CheckpointSignature
- NN saves fsimage_M+1 into its local image dirs, but does not need to do any log manipulation.
I'm still working out the backupnode operation, but I think it will actually be simplified by this proposal. Rather than having a special journaling mode, I think the NN can simply push any log roll events through the edit log stream to the BN. This will keep the roll indexes (and log contents) on the BN exactly identical to the indexes on the NN, which has good operational advantages and also reduces code complexity in the BN.
Handling multiple checkpointers
Note that in the above process there is no state stored on the NN with regard to ongoing checkpoint processes. If multiple checkpoint nodes checkpoint simultaneously, the NN will simply roll twice and hand a different index to each. Each will then upload fsimages with different indexes.
Image/edits file retention policies
There are a number of policies that should be simple to implement:
- Number of saved images - ensure that we have at least N saved images in our image directories, can delete any that are more than N versions old. Maintain edit lots that have index >= the index of the Nth oldest image.
- Time - ensure that we maintain all images within a trailing time window - again maintain all edit logs with index >= index of oldest maintained image.
- Archival - for audit purposes, the deletion mechanism could very easily be augmented to archive the edit logs for later analysis (eg to HDFS, tape, SAN, etc)
So long as any fsimage_N and all edits_M where M >= N are retained somewhere, they can be copied back into the NN's storage directories and full PITR is possible.