Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-1496

TestStorageRestore is failing after HDFS-903 fix

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Duplicate
    • Affects Version/s: 0.22.0, 0.23.0
    • Fix Version/s: 0.22.0
    • Component/s: test
    • Labels:
      None
    • Tags:
      regression

      Description

      TestStorageRestore seems to be failing after HDFS-903 commit. Running git bisect confirms it.

      1. HDFS-1496.sh
        6 kB
        Konstantin Boudnik
      2. HDFS-1496.sh
        6 kB
        Konstantin Boudnik
      3. HDFS-1496.sh
        6 kB
        Konstantin Boudnik

        Issue Links

          Activity

          Hide
          shv Konstantin Shvachko added a comment -

          Fixed by HDFS-1602.

          Show
          shv Konstantin Shvachko added a comment - Fixed by HDFS-1602 .
          Hide
          boryas Boris Shkolnik added a comment -

          Looks like the problem is that when we are trying to restore a storage dir, we format it , which always saves the current in-memory state into a new fsimage. So instead we should restore a storage without saving the state and creating new fsimage. It will be copied there during the checkpoint anyway. I've attached the patch to HDFS-1602. Please look at it and comment.(patch is for trunk).

          Show
          boryas Boris Shkolnik added a comment - Looks like the problem is that when we are trying to restore a storage dir, we format it , which always saves the current in-memory state into a new fsimage. So instead we should restore a storage without saving the state and creating new fsimage. It will be copied there during the checkpoint anyway. I've attached the patch to HDFS-1602 . Please look at it and comment.(patch is for trunk).
          Hide
          hairong Hairong Kuang added a comment -

          As I said before, I could not figure out a good way to fix it. All I could do is to disable the feature introduced by HADOOP-4885.

          Show
          hairong Hairong Kuang added a comment - As I said before, I could not figure out a good way to fix it. All I could do is to disable the feature introduced by HADOOP-4885 .
          Hide
          shv Konstantin Shvachko added a comment -

          How do we fix it?

          Show
          shv Konstantin Shvachko added a comment - How do we fix it?
          Hide
          hairong Hairong Kuang added a comment -

          Here is the inconsistency that is introduced by HADOOP-4885. After failed directories are added back, each failed directory contains new image + zero edit while each old good directory contains old image + old edit. If secondary namenode happens to fetch the image from a failed directory but fetch the edit from an old good directory, how would checkpointing work.

          Show
          hairong Hairong Kuang added a comment - Here is the inconsistency that is introduced by HADOOP-4885 . After failed directories are added back, each failed directory contains new image + zero edit while each old good directory contains old image + old edit. If secondary namenode happens to fetch the image from a failed directory but fetch the edit from an old good directory, how would checkpointing work.
          Hide
          cos Konstantin Boudnik added a comment -

          Hairong, what I am seen on a real (0.20.2 based cluster) the NN storage volume which has been once removed (e.g. because of a faulty NFS mount or something) is emptied as soon SNN starts checkpoint process. This happens because FSEditLog.synchronized void rollEditLog calls FSImage.attemptRestoreRemovedStorage and effectively formats a faulty volume if it becomes available.

          I guess it is possible that a checkpoint can happen before rollEditLog was called and than the inconsistency you've mentioned might be introduced. I think it won't happen because SecondaryNameNode.doMerge iterates through Storage.storageDirs which won't contain failed volume unless it has been restored and formatted. If this all is true then we have a test which is failing not because the feature doesn't work but rather because the test needs to be changed in lights of HDFS-903.

          Please let me know if my analysis is incorrect.

          Show
          cos Konstantin Boudnik added a comment - Hairong, what I am seen on a real (0.20.2 based cluster) the NN storage volume which has been once removed (e.g. because of a faulty NFS mount or something) is emptied as soon SNN starts checkpoint process. This happens because FSEditLog.synchronized void rollEditLog calls FSImage.attemptRestoreRemovedStorage and effectively formats a faulty volume if it becomes available. I guess it is possible that a checkpoint can happen before rollEditLog was called and than the inconsistency you've mentioned might be introduced. I think it won't happen because SecondaryNameNode.doMerge iterates through Storage.storageDirs which won't contain failed volume unless it has been restored and formatted. If this all is true then we have a test which is failing not because the feature doesn't work but rather because the test needs to be changed in lights of HDFS-903 . Please let me know if my analysis is incorrect.
          Hide
          cos Konstantin Boudnik added a comment -

          By adding new configuration parameter dfs.name.dir.restore I can see that HADOOP-4885 works on 0.20.2 based clusters where HDFS-903 hasn't been applied.

          Which still leaves the question of failing test in the trunk

          Show
          cos Konstantin Boudnik added a comment - By adding new configuration parameter dfs.name.dir.restore I can see that HADOOP-4885 works on 0.20.2 based clusters where HDFS-903 hasn't been applied. Which still leaves the question of failing test in the trunk
          Hide
          cos Konstantin Boudnik added a comment -

          fixing missed extra NN ops after NFS mount is restored.

          Show
          cos Konstantin Boudnik added a comment - fixing missed extra NN ops after NFS mount is restored.
          Hide
          cos Konstantin Boudnik added a comment -

          This system level test reproduces the same issue with local NFS server and over-NFS mounted storage volumes

          Show
          cos Konstantin Boudnik added a comment - This system level test reproduces the same issue with local NFS server and over-NFS mounted storage volumes
          Hide
          cos Konstantin Boudnik added a comment -

          Oh, sorry - I've misread you. +1 on disabling the feature because it isn't fully compatible with existing semantic expressed by functional tests.

          Show
          cos Konstantin Boudnik added a comment - Oh, sorry - I've misread you. +1 on disabling the feature because it isn't fully compatible with existing semantic expressed by functional tests.
          Hide
          hairong Hairong Kuang added a comment -

          I did not say disabling the test. I meant disabling the feature.

          Show
          hairong Hairong Kuang added a comment - I did not say disabling the test. I meant disabling the feature.
          Hide
          cos Konstantin Boudnik added a comment -

          I think it is either software bug or incorrect test. If the test ir wrong we can disable it for now and fix later. If we are seeing an actually software problem then disabling the test won't do any good, IMO.

          Show
          cos Konstantin Boudnik added a comment - I think it is either software bug or incorrect test. If the test ir wrong we can disable it for now and fix later. If we are seeing an actually software problem then disabling the test won't do any good, IMO.
          Hide
          hairong Hairong Kuang added a comment -

          I do not think that the storage directory restoration scheme introduced in HADOOP-4885 works well because it introduces inconsistent states among fsimage/edits directories. Each old good directory contains old image + old edit, but each restored directory contains new image with an empty edit log. This has the potential to corrupt fsimage if secondary NN happens to download the empty edit log from a newly restored edit log directory.

          I could not figure out a better way to fix this problem. Is it OK that I disable this feature for now so that unit test could pass? Good that Dhruba already enhanced saveNameSpace in HDFS-1509 that could be used as an alternative to restore the failed image directories.

          Show
          hairong Hairong Kuang added a comment - I do not think that the storage directory restoration scheme introduced in HADOOP-4885 works well because it introduces inconsistent states among fsimage/edits directories. Each old good directory contains old image + old edit, but each restored directory contains new image with an empty edit log. This has the potential to corrupt fsimage if secondary NN happens to download the empty edit log from a newly restored edit log directory. I could not figure out a better way to fix this problem. Is it OK that I disable this feature for now so that unit test could pass? Good that Dhruba already enhanced saveNameSpace in HDFS-1509 that could be used as an alternative to restore the failed image directories.
          Hide
          nidaley Nigel Daley added a comment -

          Causing test failure. Marking as a blocker for 0.22.

          Show
          nidaley Nigel Daley added a comment - Causing test failure. Marking as a blocker for 0.22.
          Hide
          hairong Hairong Kuang added a comment -

          This turns out to be a bug in storage directory restoration. Image validation exposes the error.

          Currently NN uses rollFSEdits to trigger storage directory recovery. The recovery may trigger a saving of the namespace to the newly restored directory which as a result changes in memory image digest. However later on image & edits were fetched from an old storage directory, thus causing the checksum mismatch.

          The problem with this storage restoration scheme is that it makes the on-disk state of all storage directories inconsistent.

          Show
          hairong Hairong Kuang added a comment - This turns out to be a bug in storage directory restoration. Image validation exposes the error. Currently NN uses rollFSEdits to trigger storage directory recovery. The recovery may trigger a saving of the namespace to the newly restored directory which as a result changes in memory image digest. However later on image & edits were fetched from an old storage directory, thus causing the checksum mismatch. The problem with this storage restoration scheme is that it makes the on-disk state of all storage directories inconsistent.
          Hide
          cos Konstantin Boudnik added a comment -

          According to git bisect

          23f1ecd155ba2cb6b22eed42541cad1d1bff329a is the first bad commit
          commit 23f1ecd155ba2cb6b22eed42541cad1d1bff329a
          Author: Hairong Kuang <hairong@apache.org>
          Date:   Mon Nov 8 06:49:32 2010 +0000
          
              HDFS-903.  Support fsimage validation using MD5 checksum. Contributed by Hairong Kuang.
          
          
              git-svn-id: https://svn.apache.org/repos/asf/hadoop/hdfs/trunk@1032470 13f79535-47bb-0310-9956-ffa450edef68
          
          :100644 100644 602f09709e3d09a62d5a77cfbd010d9c476b77c7 506f045ff0c05b10140c890b06f4e8d8405491fd M      CHANGES.txt
          :040000 040000 d31ddaa937084215931b28da37a4a3e81e9ec487 4cbbdb1597c2c66ce9da5e2a1dccdd5daa45ab6d M      src
          

          From offline conversation with Hairong it seems totally possible.

          Show
          cos Konstantin Boudnik added a comment - According to git bisect 23f1ecd155ba2cb6b22eed42541cad1d1bff329a is the first bad commit commit 23f1ecd155ba2cb6b22eed42541cad1d1bff329a Author: Hairong Kuang <hairong@apache.org> Date: Mon Nov 8 06:49:32 2010 +0000 HDFS-903. Support fsimage validation using MD5 checksum. Contributed by Hairong Kuang. git-svn-id: https://svn.apache.org/repos/asf/hadoop/hdfs/trunk@1032470 13f79535-47bb-0310-9956-ffa450edef68 :100644 100644 602f09709e3d09a62d5a77cfbd010d9c476b77c7 506f045ff0c05b10140c890b06f4e8d8405491fd M CHANGES.txt :040000 040000 d31ddaa937084215931b28da37a4a3e81e9ec487 4cbbdb1597c2c66ce9da5e2a1dccdd5daa45ab6d M src From offline conversation with Hairong it seems totally possible.

            People

            • Assignee:
              hairong Hairong Kuang
              Reporter:
              cos Konstantin Boudnik
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development