Issue Details (XML | Word | Printable)

Key: HADOOP-2585
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Konstantin Shvachko
Reporter: Konstantin Shvachko
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Automatic namespace recovery from the secondary image.

Created: 12/Jan/08 12:22 AM   Updated: 08/Jul/09 04:42 PM
Component/s: None
Affects Version/s: 0.16.0
Fix Version/s: 0.18.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works SecondaryStorage.patch 2008-04-11 10:31 AM Konstantin Shvachko 68 kB
Text File Licensed for inclusion in ASF works SecondaryStorage.patch 2008-04-11 02:46 AM Konstantin Shvachko 68 kB
Text File Licensed for inclusion in ASF works SecondaryStorage.patch 2008-04-02 08:19 PM Konstantin Shvachko 70 kB
Issue Links:
Incorporates
 

Hadoop Flags: Reviewed, Incompatible change
Release Note:
Improved management of replicas of the name space image. If all replicas on the Name Node are lost, the latest check point can be loaded from the secondary Name Node. Use parameter "-importCheckpoint" and specify the location with "fs.checkpoint.dir." The directory structure on the secondary Name Node has changed to match the primary Name Node.
Resolution Date: 11/Apr/08 09:19 PM
Labels:


 Description  « Hide
Hadoop has a three way (configuration controlled) protection from loosing the namespace image.
  1. image can be replicated on different hard-drives of the same node;
  2. image can be replicated on a nfs mounted drive on an independent node;
  3. a stale replica of the image is created during periodic checkpointing and stored on the secondary name-node.

Currently during startup the name-node examines all configured storage directories, selects the
most up to date image, reads it, merges with the corresponding edits, and writes to the new image back
into all storage directories. Everything is done automatically.

If due to multiple hardware failures none of those images on mounted hard drives (local or remote)
are available the secondary image although stale (up to one hour old by default) can be still
used in order to recover the majority of the file system data.
Currently one can reconstruct a valid name-node image from the secondary one manually.
It would be nice to support an automatic recovery.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Konstantin Shvachko added a comment - 12/Jan/08 12:27 AM
We had a real example of such failure on one of our clusters.
And we were able to reconstruct the namespace image from the secondary node using the following
manual procedure, which might be useful for those who find themselves in the same type of trouble.

Manual recovery procedure from the secondary image.

  1. Stop the cluster to make sure all data-nodes and *-tracker are down.
  2. Select a node where you will run a new name-node, and set it up as usually for the name-node.
  3. Format the new name-node.
  4. cd <dfs.name.dir>/current
  5. You will see file VERSION in there. You will need to provide namespaceID of the old cluster in it.
    The old namespaceID could be obtained from one of the data-nodes
    just copy it from <dfs.data.dir>/current/VERSION.namespaceID
  6. rm <dfs.name.dir>/current/fsimage
  7. scp <secondary-node>:<fs.checkpoint.dir>/destimage.tmp ./fsimage
  8. Start the cluster. Upgrade is recommended, so that you could rollback if something goes wrong.
  9. Run fsck, and remove files with missing blocks if any.

Automatic recovery proposal.

The proposal consists has 2 parts.

  1. The secondary node should store the latest check-pointed image file in compliance with the
    name-node storage directory structure. It is best if secondary node uses Storage class
    (or FSImage if code re-use makes sense here) in order to maintain the checkpoint directory.
    This should provide that the checkpointed image is always ready to be read by a name-node
    if the directory is listed in its "dfs.name.dir" list.
  2. The name-node should consider the configuration variable "fs.checkpoint.dir" as a possible
    location of the image available for read-only access during startup.
    This means that if name-node finds all directories listed in "dfs.name.dir" unavailable or
    finds their images corrupted, then it should turn to the "fs.checkpoint.dir" directory
    and try to fetch the image from there. I think this should not be the default behavior but
    rather triggered by a name-node startup option, something like:
    hadoop namenode -fromCheckpoint

    So the name-node can start with the secondary image as long as the secondary node drive is mounted.
    And the name-node will never attempt to write anything to this drive.

Added bonuses provided by this approach

  • One can choose to restart failed name-node directly on the node where the secondary node ran.
    This brings us a step closer to the hot standby.
  • Replication of the image to NFS can be delegated to the secondary name-node if we will
    support multiple entries in "fs.checkpoint.dir". This is of course if the administrator
    chooses to accept outdated images in order to boost the name-node performance.

Konstantin Shvachko added a comment - 02/Apr/08 08:19 PM
The main features in this patch.
  1. "-importCheckpoint" argument is introduced for the name-node startup. Started with this argument
    • The name-node will try to load image from the directory specified in "fs.checkpoint.dir" and save it to
      name-node storage directory(s) set in "dfs.name.dir;
    • The name-node will fail if a legal image is contained in "dfs.name.dir";
    • The name-node verifies that the image in "fs.checkpoint.dir" is consistent, but does not modify it in any way.
  2. secondary node storage directory structure is standardized to match the primary node directory structure.
    As a consequence we get protection
    • from accidentally starting multiple secondary nodes in the same directory
    • starting primary and secondary in the same directory.
    • The primary can be started directly with the checkpointed storage if necessary.
  3. The checkpoint directory contains now 2 last images: "current" has the most recent image, and
    "previous.checkpoint" has the previous one, which (as I understand it) was proposed in HADOOP-2987.
  4. When the checkpoint starts the name-node sends a CheckpointSignature to the secondary which contains all
    important information about the name space: layout version, namespaceID, creation time, and length of the edits log file.
    Information if verified by both nodes, and serves as a confirmation that a) the image is merged correctly, and
    b) that the name-node is receiving the right image when it downloads it back from the secondary.
    We can later extend it this class with fsimage crcs.
  5. ClientProtocol is changed as a result of that. rollEditLog() returns the signature instead of just the length of the edits file.
  6. The primary and Secondary nodes used to have 2 separate servlets for transferring images.
    The problem was that the primary and the secondary were the same web-application in the servlet container.
    I registered secondary under "webapps/secondary" and unified the GetImageServlet so that it is used for both nodes.
  7. All logic related to image transferring servlet moved from the NameNode class directly to FSImage.
  8. I fixed the bug described in HADOOP-3069.
  9. TestCheckpoint is substantially modified to cover all new cases of failure.
    Particularly, I included the test case for HADOOP-3069, which fails with old code and does not with the new one.
  10. StringUtils is extended with getStringCollection(String str) method which splits of comma delimited string
    and returns a collection rather than an array of Strings as getStrings() does.

Konstantin Shvachko made changes - 02/Apr/08 08:19 PM
Field Original Value New Value
Attachment SecondaryStorage.patch [ 12379179 ]
Konstantin Shvachko made changes - 02/Apr/08 08:20 PM
Link This issue incorporates HADOOP-3069 [ HADOOP-3069 ]
Konstantin Shvachko made changes - 02/Apr/08 08:20 PM
Link This issue incorporates HADOOP-2987 [ HADOOP-2987 ]
Konstantin Shvachko made changes - 02/Apr/08 08:25 PM
Status Open [ 1 ] Patch Available [ 10002 ]
Affects Version/s 0.16.0 [ 12312740 ]
Affects Version/s 0.15.0 [ 12312565 ]
Assignee Konstantin Shvachko [ shv ]
Hadoop QA added a comment - 02/Apr/08 10:23 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12379179/SecondaryStorage.patch
against trunk revision 643282.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 3 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs -1. The patch appears to introduce 3 new Findbugs warnings.

core tests -1. The patch failed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2127/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2127/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2127/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2127/console

This message is automatically generated.


Konstantin Shvachko made changes - 09/Apr/08 01:39 AM
Status Patch Available [ 10002 ] Open [ 1 ]
Konstantin Shvachko added a comment - 11/Apr/08 02:46 AM - edited
This is a new patch that
  1. does not contain code that was fixed in HADOOP-3069;
  2. fixes findbugs from the previous run;
  3. fixes TestCheckpoint failure.

The latter was tricky. TestCheckpoint failed on Hudson but not on any of other machines I tested it. The failure is related to that when the name-node started it did not get an exclusive lock for its storage directory as required. I initially suspected that this is a Solaris problem, but later realized that it is a NFS problem, which may not support exclusive locks consistently.
IMO we should enforce exclusivity of locks only if it is supported by a local file system.
So the test now checks whether exclusive locks are supported before failing.


Konstantin Shvachko made changes - 11/Apr/08 02:46 AM
Attachment SecondaryStorage.patch [ 12379900 ]
Konstantin Shvachko made changes - 11/Apr/08 03:03 AM
Fix Version/s 0.18.0 [ 12312972 ]
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 11/Apr/08 04:24 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12379900/SecondaryStorage.patch
against trunk revision 645773.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 3 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests -1. The patch failed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2202/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2202/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2202/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2202/console

This message is automatically generated.


dhruba borthakur added a comment - 11/Apr/08 06:51 AM
Code looks good. Only one comment:
1. The patch makes rollEditLog() return a WritableComparable . It might be better to make it return an object of type CheckpointSignature.

Konstantin Shvachko added a comment - 11/Apr/08 10:31 AM
rollEditLog() is returning CheckpointSignature instead of WritableComparable. Had to factor out CheckpointSignature into a separate file.
Fixed unit test failure.

Konstantin Shvachko made changes - 11/Apr/08 10:31 AM
Attachment SecondaryStorage.patch [ 12379913 ]
Konstantin Shvachko made changes - 11/Apr/08 10:32 AM
Status Patch Available [ 10002 ] Open [ 1 ]
Konstantin Shvachko made changes - 11/Apr/08 10:32 AM
Hadoop Flags [Reviewed]
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 11/Apr/08 12:03 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12379913/SecondaryStorage.patch
against trunk revision 645773.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 3 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2207/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2207/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2207/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2207/console

This message is automatically generated.


Repository Revision Date User Message
ASF #647313 Fri Apr 11 21:18:29 UTC 2008 shv HADOOP-2585. Name-node imports namespace data from a recent checkpoint accessible via a NFS mount. Contributed by Konstantin Shvachko.
Files Changed
MODIFY /hadoop/core/trunk/src/test/org/apache/hadoop/dfs/TestCheckpoint.java
ADD /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/CheckpointSignature.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/FSNamesystem.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/FSImage.java
MODIFY /hadoop/core/trunk/build.xml
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/FSEditLog.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/ClientProtocol.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/TransferFsImage.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/FSConstants.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/io/WritableComparable.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/SecondaryNameNode.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/GetImageServlet.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/FSDirectory.java
MODIFY /hadoop/core/trunk/CHANGES.txt
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/util/StringUtils.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/Storage.java
MODIFY /hadoop/core/trunk/conf/hadoop-default.xml
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/conf/Configuration.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/NameNode.java

Konstantin Shvachko added a comment - 11/Apr/08 09:19 PM
I just committed this.

Konstantin Shvachko made changes - 11/Apr/08 09:19 PM
Status Patch Available [ 10002 ] Resolved [ 5 ]
Resolution Fixed [ 1 ]
Repository Revision Date User Message
ASF #647338 Fri Apr 11 22:33:26 UTC 2008 shv Increment ClientProtocol.versionID missed by HADOOP-2585. Contributed by Konstantin Shvachko.
Files Changed
MODIFY /hadoop/core/trunk/CHANGES.txt
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/ClientProtocol.java

Hudson added a comment - 12/Apr/08 01:10 PM

Robert Chansler made changes - 22/Jul/08 05:28 PM
Release Note Improved management of replicas of the name space image. If all replicas on the Name Node are lost, the latest check point can be loaded from the secondary Name Node. Use parameter "-importCheckpoint" and specify the location with "fs.checkpoint.dir." The directory structure on the secondary Name Node has changed to match the primary Name Node.
Hadoop Flags [Reviewed] [Incompatible change, Reviewed]
Nigel Daley made changes - 22/Aug/08 07:50 PM
Status Resolved [ 5 ] Closed [ 6 ]
Owen O'Malley made changes - 08/Jul/09 04:42 PM
Component/s dfs [ 12310710 ]