Issue Details (XML | Word | Printable)

Key: HADOOP-3310
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Tsz Wo (Nicholas), SZE
Reporter: Tsz Wo (Nicholas), SZE
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Lease recovery for append

Created: 24/Apr/08 11:43 PM   Updated: 08/Jul/09 04:43 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.18.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 3310_20080514.patch 2008-05-14 05:25 PM Tsz Wo (Nicholas), SZE 38 kB
Text File Licensed for inclusion in ASF works 3310_20080516b.patch 2008-05-16 10:38 PM Tsz Wo (Nicholas), SZE 49 kB
Text File Licensed for inclusion in ASF works 3310_20080516c.patch 2008-05-16 10:55 PM Tsz Wo (Nicholas), SZE 47 kB
Text File Licensed for inclusion in ASF works 3310_20080519.patch 2008-05-19 08:49 PM Tsz Wo (Nicholas), SZE 51 kB
Text File Licensed for inclusion in ASF works 3310_20080519b.patch 2008-05-20 12:52 AM Tsz Wo (Nicholas), SZE 62 kB
Text File Licensed for inclusion in ASF works 3310_20080520.patch 2008-05-21 01:21 AM Tsz Wo (Nicholas), SZE 76 kB
Text File Licensed for inclusion in ASF works 3310_20080521.patch 2008-05-21 07:19 PM Tsz Wo (Nicholas), SZE 78 kB
Text File Licensed for inclusion in ASF works 3310_20080522b.patch 2008-05-22 09:20 PM Tsz Wo (Nicholas), SZE 81 kB
Text File Licensed for inclusion in ASF works 3310_20080522c.patch 2008-05-22 10:13 PM Tsz Wo (Nicholas), SZE 82 kB
Text File Licensed for inclusion in ASF works 3310_20080523.patch 2008-05-24 02:15 AM Tsz Wo (Nicholas), SZE 85 kB
Text File Licensed for inclusion in ASF works 3310_20080524_dhruba.patch 2008-05-24 09:22 AM dhruba borthakur 88 kB
Text File Licensed for inclusion in ASF works 3310_20080527.patch 2008-05-28 01:10 AM Tsz Wo (Nicholas), SZE 100 kB
Text File Licensed for inclusion in ASF works 3310_20080528.patch 2008-05-28 08:21 PM Tsz Wo (Nicholas), SZE 102 kB
Text File Licensed for inclusion in ASF works 3310_20080528b.patch 2008-05-28 09:54 PM Tsz Wo (Nicholas), SZE 101 kB
Text File Licensed for inclusion in ASF works 3310_20080528c.patch 2008-05-29 12:50 AM Tsz Wo (Nicholas), SZE 104 kB
Text File Licensed for inclusion in ASF works 3310_20080529.patch 2008-05-29 08:00 AM dhruba borthakur 96 kB
Text File Licensed for inclusion in ASF works 3310_20080529b.patch 2008-05-29 06:19 PM dhruba borthakur 105 kB
Text File Licensed for inclusion in ASF works 3310_20080529c.patch 2008-05-29 10:07 PM Tsz Wo (Nicholas), SZE 108 kB
Issue Links:
Blocker
Dependants
 

Hadoop Flags: Reviewed, Incompatible change
Release Note: Implemented Lease Recovery to sync the last bock of a file. Added ClientDatanodeProtocol for client trigging block recovery. Changed DatanodeProtocol to support block synchronization. Changed InterDatanodeProtocol to support block update.
Resolution Date: 02/Jun/08 06:46 PM


 Description  « Hide
In order to support file append, a GenerationStamp is associated with each block. Lease recovery will be performed when there is a possibility that the replicas of a block in a lease may have different GenerationStamp values.

For more details, see the documentation in HADOOP-1700.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Tsz Wo (Nicholas), SZE added a comment - 05/May/08 10:16 PM - edited
Lease Recovery Algorithm
/*
   * 1) Namenode retrieves lease information
   * 2) For each file f in the lease, consider the last block b of f
   * 2.1) Get the datanodes which contains b
   * 2.2) Assign one of the datanodes as the primary datanode p

   * 2.3) p obtains a new generation stamp form the namenode
   * 2.4) p get the block info from each datanode
   * 2.5) p computes the minimum block length
   * 2.6) p updates the datanodes, which have a valid generation stamp,
   *      with the new generation stamp and the minimum block length 
   * 2.7) p acknowledges the namenode the update results

   * 2.8) Namenode updates the BlockInfo
   * 2.9) Namenode removes f from the lease
   *      and removes the lease once all files have been removed
   * 2.10) Namenode commit changes to edit log
   */

dhruba borthakur added a comment - 14/May/08 07:07 AM
Now that HADOOP-2656 has been committed, this issue is the next in line that is required for appends. Nicholas: if you are working on this this, feel free to upload a very early version of the patch so that we can review it earlier. Thanks.

Tsz Wo (Nicholas), SZE added a comment - 14/May/08 05:25 PM
3310_20080514.patch: Implementing the Lease Recovery Algorithm

Tsz Wo (Nicholas), SZE added a comment - 15/May/08 09:09 PM
Should we start a new thread to recover block in DataNode (i.e. the case DatanodeProtocol.DNA_RECOVERBLOCK)?

dhruba borthakur added a comment - 16/May/08 04:53 AM
The processing of DNA_RECOVERBLOCK would entail making RPCs to other datanode(s), right? This should be done in a thread that is separate from the offerService thread.

Tsz Wo (Nicholas), SZE added a comment - 16/May/08 10:38 PM
3310_20080516b.patch: my latest codes

Question: When updating a block (i.e. updating generation stamp and block length), what happens if a reader tries to read the block? I guess the reader should get an exception. However, how to tells whether a block is being updated?


Tsz Wo (Nicholas), SZE added a comment - 16/May/08 10:55 PM
3310_20080516c.patch: cleanup some codes

dhruba borthakur added a comment - 19/May/08 05:34 PM
every read request has the blockid and the generation stamp. if the datanode cannot find the block (because the generation stamp has changed), then it will return an exception.

The current code also behaves as folows: When a client gets an exception, it retries other replicas. If all these replicas fail, then it goes back to the namenode to re-retrieve block locations. Now, it should get the correct generation stamp of the block. Then, the client will retry the read request to the datanode and this one shud succeed.

do you think that this will work?


Tsz Wo (Nicholas), SZE added a comment - 19/May/08 05:53 PM
>how to tells whether a block is being updated?

When updating a block, the meta file is renamed to a tmp file at first. After the update is done, the tmp file will be renamed to the new meta file (with the new generation stamp.) It should work.


Tsz Wo (Nicholas), SZE added a comment - 19/May/08 08:49 PM
3310_20080519.patch: a completed version for reviewing. Still need more tests.

Tsz Wo (Nicholas), SZE added a comment - 20/May/08 12:52 AM
3310_20080519b.patch: added a test and a few append methods in ClientProtocol, NameNode, FSNamesystem for testing. Still need more tests.

dhruba borthakur added a comment - 20/May/08 09:14 PM
One comment: The primary datanode makes an RPC call to the secondary datanode(s) to stamp the generationStamp for a block. As part of processing this request, the secondary datanode(s) should first terminate any threads that are currently writing to that block before returning "success" to this RPC. The threads that are currently writing to a block can be found in FSDataset.ActiveFile.threads.

Tsz Wo (Nicholas), SZE added a comment - 21/May/08 01:21 AM
3310_20080520.patch: Thanks, Dhruba.
  • added FSDataset.interruptOngoingCreates which is invoked in FSDataset.updateBlock.
  • The patch passed all unit tests

Tsz Wo (Nicholas), SZE added a comment - 21/May/08 07:19 PM
3310_20080521.patch: improved javadoc

dhruba borthakur added a comment - 22/May/08 07:24 AM
I see a compilation error while using the latest patch.

javac] /export/home/dhruba/snow/src/test/org/apache/hadoop/dfs/TestFileCreation.java:42: cannot find symbol
[javac] symbol : class TestFileCreation2
[javac] location: class org.apache.hadoop.dfs.TestFileCreation
[javac] static final String DIR = "/" + TestFileCreation2.class.getSimpleName() + "/";


dhruba borthakur added a comment - 22/May/08 08:05 AM
Other issues that came to my mind:

1. I am making changes to the DFDClient. When the DFSClient encounters an error in the pipeline, it eliminates the bad node from the pipeline and needs to stamp all known good replicas with the new generation stamp. The DFSClient will invoke LeaseManager.recoverBlock. This method make a two RPC calls to the namenode : getNextGenerationStamp and commitBlockSynchronization. These two methods are part of the DataodeProtocol. The problem is that when this is invoked by the DFSClient, these two RPCs should also be available thru the ClientProtocol. Can this be arranged?

2. internalReleaseLease invokes lease.renew(). Instead, LeaseManager.removeExpiredLease() should invoke lease.renew(). The reason being that a lease actually corresponds to multiple files.

3. removeExpiredLease is also invoked from startFileInternal. In this case, only one file in the lease should be recovered. The current code recovers all the files in the lease.


Tsz Wo (Nicholas), SZE added a comment - 22/May/08 05:49 PM
> I see a compilation error while using the latest patch.

yeah, there is a typo: TestFileCreation2 => TestFileCreation

>... these two RPCs should also be available thru the ClientProtocol. ...

But ClientProtocol is for client-namenode communication. I think we need a new RPC recoverBlock(...) in either ClientProtocol or a new client-datanode protocol.


Tsz Wo (Nicholas), SZE added a comment - 22/May/08 06:06 PM
> 2. internalReleaseLease invokes lease.renew(). Instead, LeaseManager.removeExpiredLease() should invoke lease.renew(). The reason being that a lease actually corresponds to multiple files.
>
> 3. removeExpiredLease is also invoked from startFileInternal. In this case, only one file in the lease should be recovered. The current code recovers all the files in the lease.

Then, removeExpiredLease is not useful anymore since the uses of it are different in startFileInternal and LeaseManager.Monitor.run(). I will remove it and fix the caller's codes individually.


Tsz Wo (Nicholas), SZE added a comment - 22/May/08 09:20 PM
3310_20080522b.patch:
  • fixed the typo, item 2 and item 3 mentioned above
  • added targets in INodeFileUnderConstruction
  • renew lease even if all targets are not avaliable
  • In FSDataset.updateBlock, unpdate oldblock.generationStamp before using it.
  • In LeaseManager.syncBlock, if successList.isEmpty(), don't commit the block.
  • In FSNamesystem.commitBlockSynchronization, don't write to editLog since the finalizeINodeFileUnderConstruction(...) has already done it.

Tsz Wo (Nicholas), SZE added a comment - 22/May/08 10:13 PM
3310_20080522c.patch: updated javadoc

Tsz Wo (Nicholas), SZE added a comment - 23/May/08 08:47 PM
I tried to test the patch for lease expiry. It does not work yet since we still write block to a tmp file first. FSDataset.validateBlockFile() will fail during lease recovery.

Also, FSDataset.volumeMap should use block id as key, instead of block (which compares both id and generation stamp) since the generation stamp may be not known.


Tsz Wo (Nicholas), SZE added a comment - 24/May/08 02:15 AM
3310_20080523.patch: latest codes but it fails on TestFileCreation.testFileCreationNamenodeRestart()

dhruba borthakur added a comment - 24/May/08 09:22 AM
Hi Nicholas, I took your latest patch and made changes to it so that the same lease recovery code is called from the client. It passes all unit tests except TestFileCreation. Maybe we can use this patch for further development and debugging. Also, pl feel free to make any changes to the code I added.

dhruba borthakur added a comment - 25/May/08 06:58 AM
Hi Nicholas, the more I think of this, the more it sounds logical to make FSDataset.updateBlock work correctly if the block is either in the volumeMap or in the ongoingCreates.

Even when "append" is supported, It makes sense to keep the blocks that are currently being written to in the tmpdir. This ensures that a block report will not report these blocks. It also ensures that the periodic block scanner will not operate on these blocks. It is also an indirect persistence representation of blocks that need recovery if the datanode restarts. Can this be done?


Tsz Wo (Nicholas), SZE added a comment - 27/May/08 06:04 PM
Hi Dhruba, I will fix TestFileCreation and figure out how to update a tmp file. Thank you for your comments.

Tsz Wo (Nicholas), SZE added a comment - 27/May/08 10:34 PM
The failing TestFileCreation may be caused by HADOOP-3453.

Tsz Wo (Nicholas), SZE added a comment - 28/May/08 01:10 AM
3310_20080527.patch:
  • FSDataset.updateBlock finds block file in from both volumeMap and ongoingCreates
  • TestFileCreation2 is a temporary test for running testLeaseExpireHardLimit() alone.

Tsz Wo (Nicholas), SZE added a comment - 28/May/08 08:21 PM
3310_20080528.patch: passes all tests in TestFileCreation (with a get-around of HADOOP-3453)

Tsz Wo (Nicholas), SZE added a comment - 28/May/08 09:54 PM
3310_20080528b.patch: fixed a bug and it passed all tests in my machine.

Tsz Wo (Nicholas), SZE added a comment - 29/May/08 12:50 AM
3310_20080528c.patch: fixed a problem when the last block is empty.

Tsz Wo (Nicholas), SZE added a comment - 29/May/08 02:38 AM
Passed all tests locally. Try hudson.

dhruba borthakur added a comment - 29/May/08 08:00 AM
This patch invokes the lease recovery code from the dfs client. It passes all unit tests.

Hadoop QA added a comment - 29/May/08 09:51 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12383040/3310_20080529b.patch
against trunk revision 661462.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 18 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2521/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2521/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2521/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2521/console

This message is automatically generated.


Tsz Wo (Nicholas), SZE added a comment - 29/May/08 10:07 PM
3310_20080529c.patch: fixed findbugs warning.

Hadoop QA added a comment - 30/May/08 06:40 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12383056/3310_20080529c.patch
against trunk revision 661771.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 18 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2523/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2523/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2523/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2523/console

This message is automatically generated.


Tsz Wo (Nicholas), SZE added a comment - 30/May/08 07:02 PM - edited
The failed TestIndexedSort is not related to this issue, see HADOOP-3471.

dhruba borthakur added a comment - 02/Jun/08 06:39 AM
If this patch passes random-writer/sort on a reasonable size cluster (e.g. 500 nodes), it will be ready for "commit".

Mukund Madhugiri added a comment - 02/Jun/08 05:25 PM
I did not have resources to do a 500 node run. Here are the results for a 100 node run. Please let me know that works?

Sort on 100 nodes with trunk: time in minutes

  • Random Writer: 15.96
  • Sort: 52.92
  • Validation: 11.94

Sort on 100 nodes with trunk + patch: time in minutes

  • Random Writer: 14.76
  • Sort: 53.76
  • Validation: 11.46

dhruba borthakur added a comment - 02/Jun/08 06:09 PM
Ok, 100 nodes sound good. I will commit it.

dhruba borthakur added a comment - 02/Jun/08 06:46 PM
I just committed it. Thanks Nicholas!

Hudson added a comment - 03/Jun/08 01:26 PM