|
[
Permlink
| « Hide
]
alan wootton added a comment - 26/May/06 08:46 AM
Does anyone have a clever idea of how to take a snapshot of a running, very active, and very large, NameNode? Without things timing out like crazy? We're ok with the idea of just stopping everything once in a while and restarting the namenode. Still....
It looks as if this is on a back burner. I would like to raise the issue again, because the edits file can become large very quickly depending on the frequency of file operations. I generate an edits file of more than 2.5Gb every 24 hours. Of course, the namenode server could be restarted periodically (this what I have to do right now), but this is rather interruptive to clients. What about rotating the edits file periodically and resolving the older edits file with the image file in a separate thread?
Proposal for Checkpointing the Namesystem State in DFS
------------------------------------------------------ Currently, the namesystem state in memory consists of a tree While the namenode is in operation, the edits log keeps on growing, Therefore it is necessary to periodically checkpoint the current state Image File Format on Disk The image on the disk consists of a header followed by a list of Edits Log File Format on Disk The edits log contains a list of namespace transactions. Each transaction 1. Add File : Full path of the file, replication factor, list of blocks When should we Checkpoint ? Checkpointing decision could be based on elapsed time (e.g. every hour) How should we Checkpoint ? There are a number of choices here. We describe each choice, and its pros 1. Lock the entire namespace in the main namenode thread, while we save This would disable all namenode operations while we are checkpointing 2. While saving an INode (i.e. path), only lock those nodes in the tree This would require extensive changes in the simple locking model used 3. Lock the entire namespace while we make an in-memory copy of the This would certainly be faster that option 1, since it does not involve 4. Lock the namespace. Rename the edits log. Start a new edits log. This method suffers from the same problem of requiring double the virtual 5. Lock the namespace. Rename the edits log. Start a new edits log. One observation that makes this proposal attractive is that the current Each entry in the on-disk image corresponds to a path. The only If we store the image where entries are sorted by their 'path' field, Almost all the path-related transactions in the edits log correspond to One approach to handle the directory rename operation while periodically The edits log would typically be a few Megabytes at the time of A rename operation "rename srcPath dstPath" in the edits log can be rename srcPath dstPath operation can be removed, and rename srcPath dstPath can be appended at the end of the edits log, if we apply the following rename srcPath tmpPath This way, we can manipulate the edits log, so that all the rename We change the definition of the on-disk image, so that it consists not With this modification for handling directory-renames, our periodic Id = On disk image 1. Load Ed into memory as a list of edits entries The namenode startup procedure is also modified, to be: 1. Load Id, form a root directory tree. A checkpointing approach you don't seem to evaluate is making it very cheap to clone the in-memory tree, specifically, by always copying nodes between the root and each edit. That way one can checkpoint by just grabbing the pointer to the root, and then writing the shadow tree in the background. No merging required, no complex re-sorting of operations, etc.
Well, the sort-merge approach does not seem performant anyway if the number of renames is large. So, I am adding the copy-on-write (Dhruba's comment on
Complex rename relocation stuff could be avoided if we used (unique) files ids to identify files.
In this case file name is just an attribute of the file. Renaming does not change the file id. File hierarchy is based on ids rather than file names. And if we need to sort, we sort by file ids rather than their names. I like the merging approach. It is simple in general (not in details though) and does not Proposal for Copy-On-Write FileSystem Tree For Periodic Checkpointing
We propose that the hadoop namenode image be checkpointed to disk after The checkpointing method we propose: 1. Does not introduce extensive changes in the simple locking model This proposal is based on making the filesystem tree copy-on-write 1. Close the transaction log 'edits.N', where N is the current This checkpointing thread: 1. Acquires global namesystem lock. Step 6 operation will become clear, when we describe how the namenode Namenode server threads always acquire the global namesystem lock 1. Check if checkpointingInProgress is false. Step 6 of the checkpointing node consists of traversing the With this checkpointing scheme, the namenode startup procedure remains Minor correction:
In the comment above, instead of Step 6 of the checkpointing node consists of Please read: Step 6 of the checkpointing thread consists of The design looks pretty simple and clean.
I still like the merging approach better. It is stand-alone! There is no need to change anything in the name-node code. It is useful as a maintenance utility for merging edits and images externally. Does not lock name-node. At some point the name-node data structures should be revised substantially and this copy-on-write effort will most probably be a wasted effort. Does it make sense to invest more effort in designing a simpler merge algorithm? If we still choose to do that:
Merging of fsimage with the edits can be done using O(sqrt( number of files )) memory.
Suppose the number of files in fsimage (sorted by path name) is N. If we need to tighten the memory requirement then we can divide N into smaller number Sounds a lot like a BTree and comes with all of the issues. Lots of IO and complexity. Reimplementing that seems like a bad idea. perhaps you can find a good java BTree, but this seems like a big, heavy piece of code.
Why do we need to do this? Here is a patch on the current Hadoop trunk .
This patch do automatic checkpoints without locking the filesystem. When it is time to do a checkpoint, edit logs stream are closed and new edit logs are opened, a thread is created that create a fake FSNamesystem that will merge previously written logs into fsimage. At the end, new edit logs are renamed to their old names. It will consume as much memory during the chekpointing as the current running instance of the FSNamesystem. The auto checkpointing feature is disabled by default. So applying the patch "as is" is almost safe. (It does not break current image and logs format and loading philosophy) Nonetheless, I can understand that you, the Hadoop dev team, does not want to integrate this huge hacky patch as a part of the hadoop distribution... Right patch with a right name for the unit test case
Ouch ! I'm tired today sorry, here is the right patch !
Thanks Philippe, this is a refined effort that uses just the existing code to upload the image and merge it with the edits.
Unfortunately, it doubles the memory consumption during checkpointing, which is what this issue all about imo. > Sounds a lot like a BTree and comes with all of the issues. > Why do we need to do this? The copy-on-write approach potentially leads to a linear memory increase and requires additional name-node data structures. I was trying to come up with a simpler algorithm for the stand-alone checkpointing. The Backup Namenode Proposal
-------------------------------------------- The idea is to create a backup namenode, download the fsimage and the edits file to the backup namenode, merge them into a single image and then upload the newly created image into the primary namenode. This approach has the following advantages: The namenode when invoked with the "-backupmode" command line option functions as the backup namenode. No additional scripts needed. One can run the backup namenode and the primary namenode on the same physical machine. The backup namenode downloads the fsimage and the edits from the primary namenode through a http-get message. The primary namenode rolls the edit file on disk, send starts logging new transactions into the new editlog file. The backup namenode merges the downloaded fsimage and edit into a new image file. It then uploads the new image file to the primary namenode. The primary namenode replaces the old fsimage and the old editlog with the new uploaded fsimage. I like this proposal. It is simple, clean, reliable.
We need a backup namenode anyway for a production deployment. I like this proposal too
Much cleaner that my hacky patch ! Here is a much detailed writeup on the Backup NameNode proposal. "Secondary NameNode" and "Backup NameNode" refer to the same node in this writeup. Please review and comment.
Configuration The configuration file will have a the following new definitions:
The Secondary NameNode will use "org.apache.hadoop.dfs.NameNode.Alternate" property to log its debug and informational messages. Primary NameNode
The NameNode will have two additional servlets:
The Primary NameNode at startup time deletes fsImage.tmp (if it exists). The NameNode loads the fsImage, then loads the edits and then loads edits.1. Then it writes the merged fsImage, deletes edits and edits.1. Secondary NameNode The Secondary NameNode issues the rollEditLog() RPC to instruct the Primary NameNode to start logging edits into edits.1. The Secondary NameNode then uses the getFile servlet to fetch the contents of fsImage and edits. It puts them in the fs.checkpoint.dir and, reads them into memory, merges them and writes it back to fsImage.tmp. The Secondary NameNode than uploads the fsImage.tmp file to the Primary NameNode using the putFsImage servlet. Once the above steps are successful, the Secondary NameNode issues the rollFsImage() RPC. A checkpoint is complete when this RPC completes successfully. If any of the RPC calls returns an error, the Secondary NameNode discards all processing that it might have done, logs an error message, and waits for the normal trigger to start the next checkpoint. Issues 2. The fact that rollFsImage() fails if either edits or edits.1 are non-existent means that the system is protected against spurious checkpoint if the NameNode restarts when the Secondary NameNode was doing a merge. This check can be made more explicit by returning a cookie with the rollEditLog() command and enforcing that rollFsImage() supplies the same cookie. (The Primary NameNode resets the cookie if it restarts). > * fs.checkpoint.period : Time (in seconds) between two checkpoints.
This is the maximal time between the checkpoints, right? We should introduce new NamenodeProtocol for primary to secondary name-node communication. I'd go in the other direction: the primary node checks the edits log size each time it adds an entry. I think we should think about supporting multiple secondary nodes at least at the design stage. I agree that we can introduce a new Protocol called the SecondaryNamenodeProcotol.
I would still persist with the proposal that the primary namenode is just a "slave" as far as periodic checkpointing is concerned. All the "intelligence" of when to create the checkpoint, how to create it, etc.etc remains with the SecondaryNamenode. In the case when we support multiple Secondary namenodes doing their own periodic checkpointing according to their own schedules, the primary Namenode would otherwise have to do lots of schedule management for each of these periodic-checkpointers. An addition to the above proposal. There will be an additional parameter named dfs.namenode.secondary.configfile that contains the absolute pathname of the "masters" file.
The namenode will allow SecondaryNamenodeProcotol connections only from nodes listed in the "masters' file. Also, the webUI can query the namenode to list the identities of the Secondary Namenodes. My idea of supporting multiple secondary nodes was that the primary node always deals with ONE secondary node, which in turn
becomes the "primary" node for next secondary node, and so on. The order of the nodes is defined by how the secondary nodes are listed in the config file. That way each name-node need to know and speak to only one secondary, which substantially simplifies the logic. The primary decides when the new check point should be created and initiates the chain of checkpoints. I think we want to avoid heartbeat processing from secondary nodes and minimize inter-name-node communication traffic. I'd prefer to have all configuration in config file rather than configurable paths to other files, containing edditional configuration parameters. Konstantin proposal is essentially a push model where the primary Namenode drives the scehduling policies of the periodic checkpointing. Also, he mentioned about supporting cascading secondaries.
I am going ahead the pull model: the Namenode is a very passive entity as far as periodic checkpointing is concerned. The scheduling policies are maintained only by the secondary namenode. The secondary namenode polls the primary periodically (say every 5 minutes) to determine the size of the current edit log. The secondary would use HTTP-GET method to transfer fsmage and edits. Al alternative that was discussed was to use HDFS itself to transfer the image. However using HDFS has the disadvantage that the secondary would have to poll the primary to determine when the upload to HDFS was complete (HDFS does not have streaming RPC and has a fixed timeout for an RPC). First draft of periodic chekpointing code. Code review comments appreciated.
The second draft of the periodic checkpointing code. This includes a unit test case to test checkpointing as part of nightly test run.
Comments on Periodic Checkpointing Patch - v2
---------------------------------------- fs.checkpoint.period should be in seconds, not milliseconds. I am trying to come up with the default values for the following configurable parameters:
1. The size of the edit log that can cause the next checkpoint. The next periodic checkpoint occurs whenever at least one of the above conditions are met. Assuming that a transaction takes 200 bytes in the edits log and the rate of 100 transactions per second, the edit log will increase at the rate of about 70MB per hour. Thus I am proposing that the default values for periodic checkpoints be Comments appreciated. 64KB seems aggressive, it only allows about 300 transactions before a checkpoint happens. Checkpointing should be frequent enough that when the namenode restarts it should be able to merge the edits into the namespace in a fairly short period of time (10-15 seconds perhaps?).
It might be useful to monitor how many edits can be merged into the image in 10 seconds (on a 2Ghz CPU say) and use something around that as the default. Sorry, a typo. My earlier calculations should have resulted in:
1. edit log size = 64MB (not KB) My theory is that a checkpoint time is actually dominated by the time to read and write the image to disk and the time to transfer the the image back and forth between the namenode and the secondary namenode. The actual CPU time needed to do the merge of the edit log with the fsimage may be relatively small.
My back-of-the-envelope calculation shows that sequential disk IO rates are at best 2GB per minute. Assuming a 2 GB image: 1. NameNode reads image = 1min Thus a total of 3 to 4 minutes can be used to do a checkpoint (ignoring network issues). Is it possible to eleminate 1. and 2. above? Since secondary holds the new image from the last checkpoint, at the start of checkpoint secondary can exchange a token with the primary to confirm its image is the latest and merge with that image. Yes, it is possible to optimize. But it will not be part of my current implementation.
This incorporates most review comments.
The main changes in this round of changes were in the following areas: 1. Handle error cases when one namedir is bad. regd 64MB default size, It need not be tailored so that it takes one hour to reach the limit.. there is already other variable to do checkpoints every one hour. As Sameer pointed out its size could be determined by how it affects NameNode start up time. Since namenode start time depends more on the image size, edits log size could be larger. A small value could have more burden on NameNodes that are already busy. In a big deployment, default many not matter at all since it will be manually set. Are you saying that the default editlogsize should be 128MB or so?
How much time does it take to merge 512MB? I think as long as time to merge edit.log is in the order of 30-60 sec, that should be fine. That way if a node is lightly loaded it will checkpoint every hour and edit.log would still be much smaller and if it busy, we won't add extra checkpointing load. Another thing, to implement 'image file token' to avoid steps 1. and 2. above, we don't need to store any extra state on disk. This would just be a runtime property. Every time primary gets a new image from secondary, it also exchanges a 'token/etag'. If primary restarts, first check point will have steps 1. and 2. The namenode is typically bottlenecked on CPU whereas a new checkpoint-upload is IO bound. A periodic checkpoint acquires the fsnamesystem lock and switches/renames files. The overhead of doing this operation once every 360000 namenode transactions (in our hour) should be minimal. I would like to make 64MB the default editlogsize and once deployed, I will observe a real life cluster and then determine if we need to change the default size.
Reagrding the optimization of 'image token': I would defer it till I see measureable difference in performance on a real cluster. My goal is to keep the periodic checkpoint "as-simple-as-possible" while achieving the goal of making the "namenode-restart faster". Yes, the optimization should be/should be done later. This has been merged with the latest trunk. Please review.
This is the final patch for periodic checkpointing. Review comments from Milind that were incorporated here were:
[Minor] hadoop-config.sh added export command wrongly indented [Reconsider Design] Why do we prevent periodic checkpointing if the namenode is in safemode ? Periodic checkpointing does not alter namespace, so should be allowed in safemode. [Minor] Clearly mark ErrorSimulation functions in SecondaryNamenode.java as used only from the testcase, so as to avoid confusion while reading code. Please incorporate periodiccheckpoint6.patch. Thanks.
+1, because http://issues.apache.org/jira/secure/attachment/12349479/periodiccheckpoint6.patch
bin/slaves.sh caused secondary namenodes to start on all datanodes (only if HADOOP_SLAVES defined in hadoop-env.sh)
+1
I've run sort, smalljobs, nnbench, and dfsio benchmarks with this patch. All pass. I just committed this. Thanks, Dhruba!
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||