Hadoop Common
  1. Hadoop Common
  2. HADOOP-2585

Automatic namespace recovery from the secondary image.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.16.0
    • Fix Version/s: 0.18.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Hide
      Improved management of replicas of the name space image. If all replicas on the Name Node are lost, the latest check point can be loaded from the secondary Name Node. Use parameter "-importCheckpoint" and specify the location with "fs.checkpoint.dir." The directory structure on the secondary Name Node has changed to match the primary Name Node.
      Show
      Improved management of replicas of the name space image. If all replicas on the Name Node are lost, the latest check point can be loaded from the secondary Name Node. Use parameter "-importCheckpoint" and specify the location with "fs.checkpoint.dir." The directory structure on the secondary Name Node has changed to match the primary Name Node.

      Description

      Hadoop has a three way (configuration controlled) protection from loosing the namespace image.

      1. image can be replicated on different hard-drives of the same node;
      2. image can be replicated on a nfs mounted drive on an independent node;
      3. a stale replica of the image is created during periodic checkpointing and stored on the secondary name-node.

      Currently during startup the name-node examines all configured storage directories, selects the
      most up to date image, reads it, merges with the corresponding edits, and writes to the new image back
      into all storage directories. Everything is done automatically.

      If due to multiple hardware failures none of those images on mounted hard drives (local or remote)
      are available the secondary image although stale (up to one hour old by default) can be still
      used in order to recover the majority of the file system data.
      Currently one can reconstruct a valid name-node image from the secondary one manually.
      It would be nice to support an automatic recovery.

      1. SecondaryStorage.patch
        70 kB
        Konstantin Shvachko
      2. SecondaryStorage.patch
        68 kB
        Konstantin Shvachko
      3. SecondaryStorage.patch
        68 kB
        Konstantin Shvachko

        Issue Links

          Activity

          Konstantin Shvachko created issue -
          Hide
          Konstantin Shvachko added a comment -

          We had a real example of such failure on one of our clusters.
          And we were able to reconstruct the namespace image from the secondary node using the following
          manual procedure, which might be useful for those who find themselves in the same type of trouble.

          Manual recovery procedure from the secondary image.

          1. Stop the cluster to make sure all data-nodes and *-tracker are down.
          2. Select a node where you will run a new name-node, and set it up as usually for the name-node.
          3. Format the new name-node.
          4. cd <dfs.name.dir>/current
          5. You will see file VERSION in there. You will need to provide namespaceID of the old cluster in it.
            The old namespaceID could be obtained from one of the data-nodes
            just copy it from <dfs.data.dir>/current/VERSION.namespaceID
          6. rm <dfs.name.dir>/current/fsimage
          7. scp <secondary-node>:<fs.checkpoint.dir>/destimage.tmp ./fsimage
          8. Start the cluster. Upgrade is recommended, so that you could rollback if something goes wrong.
          9. Run fsck, and remove files with missing blocks if any.

          Automatic recovery proposal.

          The proposal consists has 2 parts.

          1. The secondary node should store the latest check-pointed image file in compliance with the
            name-node storage directory structure. It is best if secondary node uses Storage class
            (or FSImage if code re-use makes sense here) in order to maintain the checkpoint directory.
            This should provide that the checkpointed image is always ready to be read by a name-node
            if the directory is listed in its "dfs.name.dir" list.
          2. The name-node should consider the configuration variable "fs.checkpoint.dir" as a possible
            location of the image available for read-only access during startup.
            This means that if name-node finds all directories listed in "dfs.name.dir" unavailable or
            finds their images corrupted, then it should turn to the "fs.checkpoint.dir" directory
            and try to fetch the image from there. I think this should not be the default behavior but
            rather triggered by a name-node startup option, something like:
            hadoop namenode -fromCheckpoint
            

            So the name-node can start with the secondary image as long as the secondary node drive is mounted.
            And the name-node will never attempt to write anything to this drive.

          Added bonuses provided by this approach

          • One can choose to restart failed name-node directly on the node where the secondary node ran.
            This brings us a step closer to the hot standby.
          • Replication of the image to NFS can be delegated to the secondary name-node if we will
            support multiple entries in "fs.checkpoint.dir". This is of course if the administrator
            chooses to accept outdated images in order to boost the name-node performance.
          Show
          Konstantin Shvachko added a comment - We had a real example of such failure on one of our clusters. And we were able to reconstruct the namespace image from the secondary node using the following manual procedure, which might be useful for those who find themselves in the same type of trouble. Manual recovery procedure from the secondary image. Stop the cluster to make sure all data-nodes and *-tracker are down. Select a node where you will run a new name-node, and set it up as usually for the name-node. Format the new name-node. cd <dfs.name.dir>/current You will see file VERSION in there. You will need to provide namespaceID of the old cluster in it. The old namespaceID could be obtained from one of the data-nodes just copy it from <dfs.data.dir>/current/VERSION.namespaceID rm <dfs.name.dir>/current/fsimage scp <secondary-node>:<fs.checkpoint.dir>/destimage.tmp ./fsimage Start the cluster. Upgrade is recommended, so that you could rollback if something goes wrong. Run fsck, and remove files with missing blocks if any. Automatic recovery proposal. The proposal consists has 2 parts. The secondary node should store the latest check-pointed image file in compliance with the name-node storage directory structure. It is best if secondary node uses Storage class (or FSImage if code re-use makes sense here) in order to maintain the checkpoint directory. This should provide that the checkpointed image is always ready to be read by a name-node if the directory is listed in its "dfs.name.dir" list. The name-node should consider the configuration variable "fs.checkpoint.dir" as a possible location of the image available for read-only access during startup. This means that if name-node finds all directories listed in "dfs.name.dir" unavailable or finds their images corrupted, then it should turn to the "fs.checkpoint.dir" directory and try to fetch the image from there. I think this should not be the default behavior but rather triggered by a name-node startup option, something like: hadoop namenode -fromCheckpoint So the name-node can start with the secondary image as long as the secondary node drive is mounted. And the name-node will never attempt to write anything to this drive. Added bonuses provided by this approach One can choose to restart failed name-node directly on the node where the secondary node ran. This brings us a step closer to the hot standby. Replication of the image to NFS can be delegated to the secondary name-node if we will support multiple entries in "fs.checkpoint.dir". This is of course if the administrator chooses to accept outdated images in order to boost the name-node performance.
          Hide
          Konstantin Shvachko added a comment -

          The main features in this patch.

          1. "-importCheckpoint" argument is introduced for the name-node startup. Started with this argument
            • The name-node will try to load image from the directory specified in "fs.checkpoint.dir" and save it to
              name-node storage directory(s) set in "dfs.name.dir;
            • The name-node will fail if a legal image is contained in "dfs.name.dir";
            • The name-node verifies that the image in "fs.checkpoint.dir" is consistent, but does not modify it in any way.
          2. secondary node storage directory structure is standardized to match the primary node directory structure.
            As a consequence we get protection
            • from accidentally starting multiple secondary nodes in the same directory
            • starting primary and secondary in the same directory.
            • The primary can be started directly with the checkpointed storage if necessary.
          3. The checkpoint directory contains now 2 last images: "current" has the most recent image, and
            "previous.checkpoint" has the previous one, which (as I understand it) was proposed in HADOOP-2987.
          4. When the checkpoint starts the name-node sends a CheckpointSignature to the secondary which contains all
            important information about the name space: layout version, namespaceID, creation time, and length of the edits log file.
            Information if verified by both nodes, and serves as a confirmation that a) the image is merged correctly, and
            b) that the name-node is receiving the right image when it downloads it back from the secondary.
            We can later extend it this class with fsimage crcs.
          5. ClientProtocol is changed as a result of that. rollEditLog() returns the signature instead of just the length of the edits file.
          6. The primary and Secondary nodes used to have 2 separate servlets for transferring images.
            The problem was that the primary and the secondary were the same web-application in the servlet container.
            I registered secondary under "webapps/secondary" and unified the GetImageServlet so that it is used for both nodes.
          7. All logic related to image transferring servlet moved from the NameNode class directly to FSImage.
          8. I fixed the bug described in HADOOP-3069.
          9. TestCheckpoint is substantially modified to cover all new cases of failure.
            Particularly, I included the test case for HADOOP-3069, which fails with old code and does not with the new one.
          10. StringUtils is extended with getStringCollection(String str) method which splits of comma delimited string
            and returns a collection rather than an array of Strings as getStrings() does.
          Show
          Konstantin Shvachko added a comment - The main features in this patch. "-importCheckpoint" argument is introduced for the name-node startup. Started with this argument The name-node will try to load image from the directory specified in "fs.checkpoint.dir" and save it to name-node storage directory(s) set in "dfs.name.dir; The name-node will fail if a legal image is contained in "dfs.name.dir"; The name-node verifies that the image in "fs.checkpoint.dir" is consistent, but does not modify it in any way. secondary node storage directory structure is standardized to match the primary node directory structure. As a consequence we get protection from accidentally starting multiple secondary nodes in the same directory starting primary and secondary in the same directory. The primary can be started directly with the checkpointed storage if necessary. The checkpoint directory contains now 2 last images: "current" has the most recent image, and "previous.checkpoint" has the previous one, which (as I understand it) was proposed in HADOOP-2987 . When the checkpoint starts the name-node sends a CheckpointSignature to the secondary which contains all important information about the name space: layout version, namespaceID, creation time, and length of the edits log file. Information if verified by both nodes, and serves as a confirmation that a) the image is merged correctly, and b) that the name-node is receiving the right image when it downloads it back from the secondary. We can later extend it this class with fsimage crcs. ClientProtocol is changed as a result of that. rollEditLog() returns the signature instead of just the length of the edits file. The primary and Secondary nodes used to have 2 separate servlets for transferring images. The problem was that the primary and the secondary were the same web-application in the servlet container. I registered secondary under "webapps/secondary" and unified the GetImageServlet so that it is used for both nodes. All logic related to image transferring servlet moved from the NameNode class directly to FSImage. I fixed the bug described in HADOOP-3069 . TestCheckpoint is substantially modified to cover all new cases of failure. Particularly, I included the test case for HADOOP-3069 , which fails with old code and does not with the new one. StringUtils is extended with getStringCollection(String str) method which splits of comma delimited string and returns a collection rather than an array of Strings as getStrings() does.
          Konstantin Shvachko made changes -
          Field Original Value New Value
          Attachment SecondaryStorage.patch [ 12379179 ]
          Konstantin Shvachko made changes -
          Link This issue incorporates HADOOP-3069 [ HADOOP-3069 ]
          Konstantin Shvachko made changes -
          Link This issue incorporates HADOOP-2987 [ HADOOP-2987 ]
          Konstantin Shvachko made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Affects Version/s 0.16.0 [ 12312740 ]
          Affects Version/s 0.15.0 [ 12312565 ]
          Assignee Konstantin Shvachko [ shv ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12379179/SecondaryStorage.patch
          against trunk revision 643282.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 3 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs -1. The patch appears to introduce 3 new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2127/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2127/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2127/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2127/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12379179/SecondaryStorage.patch against trunk revision 643282. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to introduce 3 new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2127/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2127/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2127/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2127/console This message is automatically generated.
          Konstantin Shvachko made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Konstantin Shvachko added a comment - - edited

          This is a new patch that

          1. does not contain code that was fixed in HADOOP-3069;
          2. fixes findbugs from the previous run;
          3. fixes TestCheckpoint failure.

          The latter was tricky. TestCheckpoint failed on Hudson but not on any of other machines I tested it. The failure is related to that when the name-node started it did not get an exclusive lock for its storage directory as required. I initially suspected that this is a Solaris problem, but later realized that it is a NFS problem, which may not support exclusive locks consistently.
          IMO we should enforce exclusivity of locks only if it is supported by a local file system.
          So the test now checks whether exclusive locks are supported before failing.

          Show
          Konstantin Shvachko added a comment - - edited This is a new patch that does not contain code that was fixed in HADOOP-3069 ; fixes findbugs from the previous run; fixes TestCheckpoint failure. The latter was tricky. TestCheckpoint failed on Hudson but not on any of other machines I tested it. The failure is related to that when the name-node started it did not get an exclusive lock for its storage directory as required. I initially suspected that this is a Solaris problem, but later realized that it is a NFS problem, which may not support exclusive locks consistently. IMO we should enforce exclusivity of locks only if it is supported by a local file system. So the test now checks whether exclusive locks are supported before failing.
          Konstantin Shvachko made changes -
          Attachment SecondaryStorage.patch [ 12379900 ]
          Konstantin Shvachko made changes -
          Fix Version/s 0.18.0 [ 12312972 ]
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12379900/SecondaryStorage.patch
          against trunk revision 645773.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 3 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2202/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2202/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2202/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2202/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12379900/SecondaryStorage.patch against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2202/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2202/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2202/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2202/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          Code looks good. Only one comment:
          1. The patch makes rollEditLog() return a WritableComparable . It might be better to make it return an object of type CheckpointSignature.

          Show
          dhruba borthakur added a comment - Code looks good. Only one comment: 1. The patch makes rollEditLog() return a WritableComparable . It might be better to make it return an object of type CheckpointSignature.
          Hide
          Konstantin Shvachko added a comment -

          rollEditLog() is returning CheckpointSignature instead of WritableComparable. Had to factor out CheckpointSignature into a separate file.
          Fixed unit test failure.

          Show
          Konstantin Shvachko added a comment - rollEditLog() is returning CheckpointSignature instead of WritableComparable. Had to factor out CheckpointSignature into a separate file. Fixed unit test failure.
          Konstantin Shvachko made changes -
          Attachment SecondaryStorage.patch [ 12379913 ]
          Konstantin Shvachko made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Konstantin Shvachko made changes -
          Hadoop Flags [Reviewed]
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12379913/SecondaryStorage.patch
          against trunk revision 645773.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 3 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2207/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2207/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2207/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2207/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12379913/SecondaryStorage.patch against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2207/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2207/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2207/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2207/console This message is automatically generated.
          Hide
          Konstantin Shvachko added a comment -

          I just committed this.

          Show
          Konstantin Shvachko added a comment - I just committed this.
          Konstantin Shvachko made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #458 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/458/ )
          Robert Chansler made changes -
          Release Note Improved management of replicas of the name space image. If all replicas on the Name Node are lost, the latest check point can be loaded from the secondary Name Node. Use parameter "-importCheckpoint" and specify the location with "fs.checkpoint.dir." The directory structure on the secondary Name Node has changed to match the primary Name Node.
          Hadoop Flags [Reviewed] [Incompatible change, Reviewed]
          Nigel Daley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Owen O'Malley made changes -
          Component/s dfs [ 12310710 ]

            People

            • Assignee:
              Konstantin Shvachko
              Reporter:
              Konstantin Shvachko
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development