Issue Details (XML | Word | Printable)

Key: HADOOP-1076
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: dhruba borthakur
Reporter: Konstantin Shvachko
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Periodic checkpointing cannot resume if the secondary name-node fails.

Created: 07/Mar/07 04:07 AM   Updated: 08/Jul/09 04:42 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.15.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works secondaryRestart4.patch 2007-09-20 11:18 PM dhruba borthakur 14 kB

Resolution Date: 24/Sep/07 08:57 PM


 Description  « Hide
If secondary name-node fails during checkpointing then the primary node will have 2 edits file.
"edits" - is the one which current checkpoint is to be based upon.
"edits.new" - is where new name space edits are currently logged.
The problem is that the primary node cannot do checkpointing until "edits.new" file is in place.
That is, even if the secondary name-node is restarted periodic checkpointing is not going to be resumed.
In fact the primary node will be throwing an exception complaining about the existing "edits.new"
There is only one way to get rid of the edits.new file - to restart the primary name-node.
So in a way if secondary name-node fails then you should restart the whole cluster.

Here is a rather simple modification to the current approach, which we discussed with Dhruba.
When secondary node requests to rollEditLog() the primary node should roll the edit log only if
it has not been already rolled. Otherwise the existing "edits" file will be used for checkpointing
and the primary node will keep accumulating new edits in the "edits.new".
In order to make it work the primary node should also ignore any rollFSImage() requests when it
already started to perform one. Otherwise the new image can become corrupted if two secondary
nodes request to rollFSImage() at the same time.

2. Also, after the periodic checkpointing patch HADOOP-227 I see pieces of unusable code.
I noticed one data member SecondaryNameNode.localName and at least 4 methods in FSEditLog
that are not used anywhere. We should remove them and others alike if found.
Supporting unusable code is such a waist of time.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
dhruba borthakur made changes - 18/Jul/07 06:14 PM
Field Original Value New Value
Fix Version/s 0.15.0 [ 12312565 ]
dhruba borthakur made changes - 28/Aug/07 08:59 AM
Attachment secondaryRestart.patch [ 12364671 ]
dhruba borthakur made changes - 28/Aug/07 08:59 AM
Assignee dhruba borthakur [ dhruba ]
dhruba borthakur made changes - 10/Sep/07 11:31 PM
Attachment secondaryRestart2.patch [ 12365517 ]
dhruba borthakur made changes - 10/Sep/07 11:31 PM
Attachment secondaryRestart.patch [ 12364671 ]
dhruba borthakur made changes - 18/Sep/07 10:48 PM
Attachment secondaryRestart3.patch [ 12366136 ]
dhruba borthakur made changes - 18/Sep/07 10:48 PM
Attachment secondaryRestart2.patch [ 12365517 ]
dhruba borthakur made changes - 20/Sep/07 11:18 PM
Attachment secondaryRestart4.patch [ 12366319 ]
dhruba borthakur made changes - 20/Sep/07 11:18 PM
Attachment secondaryRestart3.patch [ 12366136 ]
dhruba borthakur made changes - 24/Sep/07 05:24 PM
Status Open [ 1 ] Patch Available [ 10002 ]
dhruba borthakur made changes - 24/Sep/07 08:57 PM
Resolution Fixed [ 1 ]
Status Patch Available [ 10002 ] Resolved [ 5 ]
Doug Cutting made changes - 05/Nov/07 06:11 PM
Status Resolved [ 5 ] Closed [ 6 ]
Owen O'Malley made changes - 08/Jul/09 04:42 PM
Component/s dfs [ 12310710 ]