Well, what I mean by the parent thread holding the lock is the following:
the saveNamespace method is synchronized in the FSNamesystem and currently while holding this lock, the handler thread walks the tree N times and writes N files, so in a way we assume that the tree is guarded from all the modifications by the FSNamesystem lock.
The same is true for the patch, except in this case we are walking the tree by N different threads. But operating under the same assumptions that while we are holding the FSNamesystem lock the tree is not being modified, and the handler thread is waiting for all worker threads to finish writing to their files before returning from the section synchronized on FSNamesystem.
We just deployed this patch internally to our production cluster:
2010-06-22 10:12:59,714 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 11906663754 saved in 140 seconds.
2010-06-22 10:13:50,626 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 11906663754 saved in 191 seconds.
This saved us 140 seconds on the current image.
As far as both copies being on the same drive is concerned - I guess this patch will not give much of an improvement.
However I am not sure there is much value in storing two copies of the image on the same drive?
Please correct me if I am wrong, but I thought that multiple copies of the image should theoretically be stored on different drives to help in case of drive failure (or on a filer to protect against machine dying), and storing two copies on the same drive only helps with file corruption (accidental deletion) and that is a weak argument to have multiple copies on one physical drive?
I like your approach with one thread doing serialization and others doing writes, but it seems like it is a lot more complicated than the one in this patch.
Because I am simply executing one call in a new born thread, while with serializer-writer approach there will be more implementation questions, like what to do with multiple writers that consume their queues at different speeds. You cannot grow the queue indefinitely, since the namenode will simply run out of memory, on the other hand you might want to write things out to faster consumers as quickly as possible.
And the main benefit I see is only doing serialization of a tree once, but since we are holding the FSNamesystem lock at that time the NameNode doesn't do much anyways, it is also not worse than what was in place before that (serialization was taking place once per image location).