Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1481

NameNode should validate fsimage before rolling

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.23.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We had an incident that the fsimage at secondary NameNode was truncated but got uploaded to the primary NameNode. The primary NameNode simply rolled the image without checking its integrity, therefore causing the fsimage to corrupt. The primary NameNode should check the new image's integrity before rolling fsimage.

      1. trunkValidateUpload.patch
        7 kB
        Hairong Kuang
      2. trunkValidateUpload1.patch
        7 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          Hairong Kuang added a comment -

          This patch calculates checksum when NameNode download image from Secondary or Checkpointing NameNode. If the calculated checksum does not match the one supposed to be, it fails the image transfer.

          Show
          Hairong Kuang added a comment - This patch calculates checksum when NameNode download image from Secondary or Checkpointing NameNode. If the calculated checksum does not match the one supposed to be, it fails the image transfer.
          Hide
          Konstantin Shvachko added a comment -

          The patch looks good.
          It verifies the checksum, when NN uploads the checkpoint image.
          Looking at the code it seems that we can also verify the checksum when CN loads the image from NN, when the checkpoint starts. It seems as simple as calling TransferFsImage.getFileClient() with getChecksum = true instead of false in downloadCheckpoint(), since the signature already has imageDigest.
          For SNN it doesn't look as simple.

          Show
          Konstantin Shvachko added a comment - The patch looks good. It verifies the checksum, when NN uploads the checkpoint image. Looking at the code it seems that we can also verify the checksum when CN loads the image from NN, when the checkpoint starts. It seems as simple as calling TransferFsImage.getFileClient() with getChecksum = true instead of false in downloadCheckpoint(), since the signature already has imageDigest. For SNN it doesn't look as simple.
          Hide
          Hairong Kuang added a comment -

          Konstantin, thanks for your review. Sorry for getting back to this late. I was distracted by something else.

          For CN, image checksum validate was done when loading the image from disk to memory. So I do not think CN should validate it at download time.

          Show
          Hairong Kuang added a comment - Konstantin, thanks for your review. Sorry for getting back to this late. I was distracted by something else. For CN, image checksum validate was done when loading the image from disk to memory. So I do not think CN should validate it at download time.
          Hide
          Konstantin Shvachko added a comment -

          Looks like you are right. The new image checksum is verified against the one in the signature after the image is loaded from disk. So my suggestion would be an optimization for failing faster, but it is optional.
          +1 the patch looks good. It would be good to remove the unused import of InetAddress in Checkpointer.

          Show
          Konstantin Shvachko added a comment - Looks like you are right. The new image checksum is verified against the one in the signature after the image is loaded from disk. So my suggestion would be an optimization for failing faster, but it is optional. +1 the patch looks good. It would be good to remove the unused import of InetAddress in Checkpointer.
          Hide
          Hairong Kuang added a comment -

          This patch removed the unnecessary import in CheckPointer.

          Show
          Hairong Kuang added a comment - This patch removed the unnecessary import in CheckPointer.
          Hide
          Hairong Kuang added a comment -

          ant test-patch results:
          [exec] -1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] -1 tests included. The patch doesn't appe
          [exec] ar to include any new or modified tests.
          [exec] Please justify why no new tests are needed for this patch.
          [exec] Also please list what manual steps were performed to verify this patch.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec]
          [exec] +1 system test framework. The patch passed system test framework compile.

          This change is simple but too tricky to write a unit test. I plan to commit it without a unit test.

          Ant test failed. But all failed ones are known:
          TestBlockRecovery
          TestHDFSTrash
          TestStorageRestore
          TestBalancer
          TestSaveNamespace

          Show
          Hairong Kuang added a comment - ant test-patch results: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appe [exec] ar to include any new or modified tests. [exec] Please justify why no new tests are needed for this patch. [exec] Also please list what manual steps were performed to verify this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 system test framework. The patch passed system test framework compile. This change is simple but too tricky to write a unit test. I plan to commit it without a unit test. Ant test failed. But all failed ones are known: TestBlockRecovery TestHDFSTrash TestStorageRestore TestBalancer TestSaveNamespace
          Hide
          Hairong Kuang added a comment -

          I've committed this!

          Show
          Hairong Kuang added a comment - I've committed this!

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Hairong Kuang
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development