|
[
Permlink
| « Hide
]
Tsz Wo (Nicholas), SZE added a comment - 05/May/08 10:16 PM - edited
Lease Recovery Algorithm
Now that
3310_20080514.patch: Implementing the Lease Recovery Algorithm
Should we start a new thread to recover block in DataNode (i.e. the case DatanodeProtocol.DNA_RECOVERBLOCK)?
The processing of DNA_RECOVERBLOCK would entail making RPCs to other datanode(s), right? This should be done in a thread that is separate from the offerService thread.
3310_20080516b.patch: my latest codes
Question: When updating a block (i.e. updating generation stamp and block length), what happens if a reader tries to read the block? I guess the reader should get an exception. However, how to tells whether a block is being updated? 3310_20080516c.patch: cleanup some codes
every read request has the blockid and the generation stamp. if the datanode cannot find the block (because the generation stamp has changed), then it will return an exception.
The current code also behaves as folows: When a client gets an exception, it retries other replicas. If all these replicas fail, then it goes back to the namenode to re-retrieve block locations. Now, it should get the correct generation stamp of the block. Then, the client will retry the read request to the datanode and this one shud succeed. do you think that this will work? >how to tells whether a block is being updated?
When updating a block, the meta file is renamed to a tmp file at first. After the update is done, the tmp file will be renamed to the new meta file (with the new generation stamp.) It should work. 3310_20080519.patch: a completed version for reviewing. Still need more tests.
3310_20080519b.patch: added a test and a few append methods in ClientProtocol, NameNode, FSNamesystem for testing. Still need more tests.
One comment: The primary datanode makes an RPC call to the secondary datanode(s) to stamp the generationStamp for a block. As part of processing this request, the secondary datanode(s) should first terminate any threads that are currently writing to that block before returning "success" to this RPC. The threads that are currently writing to a block can be found in FSDataset.ActiveFile.threads.
3310_20080520.patch: Thanks, Dhruba.
3310_20080521.patch: improved javadoc
I see a compilation error while using the latest patch.
javac] /export/home/dhruba/snow/src/test/org/apache/hadoop/dfs/TestFileCreation.java:42: cannot find symbol Other issues that came to my mind:
1. I am making changes to the DFDClient. When the DFSClient encounters an error in the pipeline, it eliminates the bad node from the pipeline and needs to stamp all known good replicas with the new generation stamp. The DFSClient will invoke LeaseManager.recoverBlock. This method make a two RPC calls to the namenode : getNextGenerationStamp and commitBlockSynchronization. These two methods are part of the DataodeProtocol. The problem is that when this is invoked by the DFSClient, these two RPCs should also be available thru the ClientProtocol. Can this be arranged? 2. internalReleaseLease invokes lease.renew(). Instead, LeaseManager.removeExpiredLease() should invoke lease.renew(). The reason being that a lease actually corresponds to multiple files. 3. removeExpiredLease is also invoked from startFileInternal. In this case, only one file in the lease should be recovered. The current code recovers all the files in the lease. > I see a compilation error while using the latest patch.
yeah, there is a typo: TestFileCreation2 => TestFileCreation >... these two RPCs should also be available thru the ClientProtocol. ... But ClientProtocol is for client-namenode communication. I think we need a new RPC recoverBlock(...) in either ClientProtocol or a new client-datanode protocol. > 2. internalReleaseLease invokes lease.renew(). Instead, LeaseManager.removeExpiredLease() should invoke lease.renew(). The reason being that a lease actually corresponds to multiple files.
> > 3. removeExpiredLease is also invoked from startFileInternal. In this case, only one file in the lease should be recovered. The current code recovers all the files in the lease. Then, removeExpiredLease is not useful anymore since the uses of it are different in startFileInternal and LeaseManager.Monitor.run(). I will remove it and fix the caller's codes individually. 3310_20080522b.patch:
3310_20080522c.patch: updated javadoc
I tried to test the patch for lease expiry. It does not work yet since we still write block to a tmp file first. FSDataset.validateBlockFile() will fail during lease recovery.
Also, FSDataset.volumeMap should use block id as key, instead of block (which compares both id and generation stamp) since the generation stamp may be not known. 3310_20080523.patch: latest codes but it fails on TestFileCreation.testFileCreationNamenodeRestart()
Hi Nicholas, I took your latest patch and made changes to it so that the same lease recovery code is called from the client. It passes all unit tests except TestFileCreation. Maybe we can use this patch for further development and debugging. Also, pl feel free to make any changes to the code I added.
Hi Nicholas, the more I think of this, the more it sounds logical to make FSDataset.updateBlock work correctly if the block is either in the volumeMap or in the ongoingCreates.
Even when "append" is supported, It makes sense to keep the blocks that are currently being written to in the tmpdir. This ensures that a block report will not report these blocks. It also ensures that the periodic block scanner will not operate on these blocks. It is also an indirect persistence representation of blocks that need recovery if the datanode restarts. Can this be done? Hi Dhruba, I will fix TestFileCreation and figure out how to update a tmp file. Thank you for your comments.
The failing TestFileCreation may be caused by
3310_20080527.patch:
3310_20080528.patch: passes all tests in TestFileCreation (with a get-around of
3310_20080528b.patch: fixed a bug and it passed all tests in my machine.
3310_20080528c.patch: fixed a problem when the last block is empty.
Passed all tests locally. Try hudson.
This patch invokes the lease recovery code from the dfs client. It passes all unit tests.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12383040/3310_20080529b.patch against trunk revision 661462. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 18 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2521/testReport/ This message is automatically generated. 3310_20080529c.patch: fixed findbugs warning.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12383056/3310_20080529c.patch against trunk revision 661771. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 18 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2523/testReport/ This message is automatically generated. The failed TestIndexedSort is not related to this issue, see
If this patch passes random-writer/sort on a reasonable size cluster (e.g. 500 nodes), it will be ready for "commit".
I did not have resources to do a 500 node run. Here are the results for a 100 node run. Please let me know that works?
Sort on 100 nodes with trunk: time in minutes
Sort on 100 nodes with trunk + patch: time in minutes
Ok, 100 nodes sound good. I will commit it.
I just committed it. Thanks Nicholas!
Integrated in Hadoop-trunk #511 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/511/
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||