Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-903

NN should verify images and edit logs on startup

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.22.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Store fsimage MD5 checksum in VERSION file. Validate checksum when loading a fsimage. Layout version bumped.

      Description

      I was playing around with corrupting fsimage and edits logs when there are multiple dfs.name.dirs specified. I noticed that:

      • As long as your corruption does not make the image invalid, eg changes an opcode so it's an invalid opcode HDFS doesn't notice and happily uses a corrupt image or applies the corrupt edit.
      • If the first image in dfs.name.dir is "valid" it replaces the other copies in the other name.dirs, even if they are different, with this first image, ie if the first image is actually invalid/old/corrupt metadata than you've lost your valid metadata, which can result in data loss if the namenode garbage collects blocks that it thinks are no longer used.

      How about we maintain a checksum as part of the image and edit log and check those on startup and refuse to startup if they are different. Or at least provide a configuration option to do so if people are worried about the overhead of maintaining checksums of these files. Even if we assume dfs.name.dir is reliable storage this guards against operator errors.

      1. trunkChecksumImage.patch
        19 kB
        Hairong Kuang
      2. trunkChecksumImage1.patch
        23 kB
        Hairong Kuang
      3. trunkChecksumImage2.patch
        22 kB
        Hairong Kuang
      4. trunkChecksumImage3.patch
        24 kB
        Hairong Kuang
      5. trunkChecksumImage4.patch
        24 kB
        Hairong Kuang

        Issue Links

          Activity

          Eli Collins created issue -
          Hide
          Allen Wittenauer added a comment -

          One of the discussions we had about 2 years ago was keeping minimum 3 versions of the fsimage and edits file so that we could do parity checking.

          Show
          Allen Wittenauer added a comment - One of the discussions we had about 2 years ago was keeping minimum 3 versions of the fsimage and edits file so that we could do parity checking.
          Jeff Hammerbacher made changes -
          Field Original Value New Value
          Link This issue relates to HDFS-1382 [ HDFS-1382 ]
          Hairong Kuang made changes -
          Link This issue relates to HDFS-1458 [ HDFS-1458 ]
          Hide
          Hairong Kuang added a comment -

          Any progress on this?

          If this is done, HDFS-1458 could check if primary NN has changed its image or not by simply fetching the primary NN's image's checksum and compares it to the 2nd NN's image's checksum.

          Show
          Hairong Kuang added a comment - Any progress on this? If this is done, HDFS-1458 could check if primary NN has changed its image or not by simply fetching the primary NN's image's checksum and compares it to the 2nd NN's image's checksum.
          Hide
          Eli Collins added a comment -

          Hey Hairong,

          I haven't had a chance to work on this yet, feel free to grab it. Agree this would work well with HDFS-1458.

          Thanks,
          Eli

          Show
          Eli Collins added a comment - Hey Hairong, I haven't had a chance to work on this yet, feel free to grab it. Agree this would work well with HDFS-1458 . Thanks, Eli
          Hairong Kuang made changes -
          Assignee Eli Collins [ eli ] Hairong Kuang [ hairong ]
          Hairong Kuang made changes -
          Fix Version/s 0.22.0 [ 12314241 ]
          Hide
          Hairong Kuang added a comment -

          Here is the plan:
          1. Generate a MD5 Digest when saving an image;
          2. Store MD5 Digest hash in Version file;
          3. When loading an image, generate a MD5 Digest as well and then compare it to the one storing in the Version file.

          Show
          Hairong Kuang added a comment - Here is the plan: 1. Generate a MD5 Digest when saving an image; 2. Store MD5 Digest hash in Version file; 3. When loading an image, generate a MD5 Digest as well and then compare it to the one storing in the Version file.
          Hide
          Hairong Kuang added a comment -

          Do we want to provide a configuration option for checksuming image or not? I tend to say no. But does anybody have a use case that you do not want the image to be checksumed except for the performance concern?

          Show
          Hairong Kuang added a comment - Do we want to provide a configuration option for checksuming image or not? I tend to say no. But does anybody have a use case that you do not want the image to be checksumed except for the performance concern?
          Hide
          Hairong Kuang added a comment -

          Thought more about this. Instead of storing the image's MD5 digest in VERSION file, I am thinking to store it to the end of the image file. The advantage is that the image is self-contained. We do not need to worry about atomicity when rolling images.

          Show
          Hairong Kuang added a comment - Thought more about this. Instead of storing the image's MD5 digest in VERSION file, I am thinking to store it to the end of the image file. The advantage is that the image is self-contained. We do not need to worry about atomicity when rolling images.
          Hide
          Todd Lipcon added a comment -

          I think it's better to put in VERSION file since then you can use a command line "md5sum" utility to check for corruption.

          The atomicity issue is a little annoying, I agree. I can't think of a good short-term solution... HDFS-1073 will help with that, at least.

          Show
          Todd Lipcon added a comment - I think it's better to put in VERSION file since then you can use a command line "md5sum" utility to check for corruption. The atomicity issue is a little annoying, I agree. I can't think of a good short-term solution... HDFS-1073 will help with that, at least.
          Hide
          Konstantin Shvachko added a comment -

          Hairong, we should also add the image MD5 digest into CheckpointSignature. I believe it will be easier to do that if the digest is in VERSION. Also until VERSION is written it does not matter what the image state is. If VERSION is written then the MD5 is there so everything is consistent.
          Would you consider adding MD5 into CheckpointSignature?

          Show
          Konstantin Shvachko added a comment - Hairong, we should also add the image MD5 digest into CheckpointSignature. I believe it will be easier to do that if the digest is in VERSION. Also until VERSION is written it does not matter what the image state is. If VERSION is written then the MD5 is there so everything is consistent. Would you consider adding MD5 into CheckpointSignature?
          Hide
          Hairong Kuang added a comment -

          @Todd, you made a valid point.

          @Konstantin, sooo good to see that you are back! Yes, I will add MD5 into CheckpointSignature so that HDFS-1458 can happen.

          Show
          Hairong Kuang added a comment - @Todd, you made a valid point. @Konstantin, sooo good to see that you are back! Yes, I will add MD5 into CheckpointSignature so that HDFS-1458 can happen.
          Hide
          dhruba borthakur added a comment -

          I agree with Konstantin/Hairong that the MD5 signature should be part of the CheckpointSignature.

          It would have been nice if the contents of the VERSION file was stored as a header record in the beginning of the fsimage file itself (I now remember the initial reason why the VERSION file exists separate from the fsimage: the datanode needs the VERSION file too for its block-directories and the datanode does not have a fsimage file). Given that, t should be fine to store the checkum in the VERSION file. Also, the algoritm to compute the checksum need not be configurable, it could be hardcoded to generate a MD5 checksum.

          Show
          dhruba borthakur added a comment - I agree with Konstantin/Hairong that the MD5 signature should be part of the CheckpointSignature. It would have been nice if the contents of the VERSION file was stored as a header record in the beginning of the fsimage file itself (I now remember the initial reason why the VERSION file exists separate from the fsimage: the datanode needs the VERSION file too for its block-directories and the datanode does not have a fsimage file). Given that, t should be fine to store the checkum in the VERSION file. Also, the algoritm to compute the checksum need not be configurable, it could be hardcoded to generate a MD5 checksum.
          Hide
          Allen Wittenauer added a comment -

          > I think it's better to put in VERSION file since then you can use a command line "md5sum" utility to check for corruption.

          +1

          This is much more operations friendly. If an alternative is picked-which is fine-just keep in mind we'll need a tool built to go with this change.

          Show
          Allen Wittenauer added a comment - > I think it's better to put in VERSION file since then you can use a command line "md5sum" utility to check for corruption. +1 This is much more operations friendly. If an alternative is picked- which is fine -just keep in mind we'll need a tool built to go with this change.
          Hide
          Suresh Srinivas added a comment -

          > It would have been nice if the contents of the VERSION file was stored as a header record in the beginning of the fsimage file itself
          Currently VERSION creation signals end of snapshot creation, independent of fsimage and edits creation. Moving VERSION to fsimage will complicate the current design.

          Show
          Suresh Srinivas added a comment - > It would have been nice if the contents of the VERSION file was stored as a header record in the beginning of the fsimage file itself Currently VERSION creation signals end of snapshot creation, independent of fsimage and edits creation. Moving VERSION to fsimage will complicate the current design.
          Hairong Kuang made changes -
          Link This issue is blocked by HADOOP-7009 [ HADOOP-7009 ]
          Hide
          Hairong Kuang added a comment -

          Here comes the patch.

          1. compute MD5 checksum when saving an image;
          2. an image's checksum is stored in memory and persistent on disk in VERSION;
          3. loading an image does checksum verification;
          4. Between primary & secodary NN, checksum gets transferred as part of CheckpointSignature.

          I also add an unit test for this feature.

          Show
          Hairong Kuang added a comment - Here comes the patch. 1. compute MD5 checksum when saving an image; 2. an image's checksum is stored in memory and persistent on disk in VERSION; 3. loading an image does checksum verification; 4. Between primary & secodary NN, checksum gets transferred as part of CheckpointSignature. I also add an unit test for this feature.
          Hairong Kuang made changes -
          Attachment trunkChecksumImage.patch [ 12458117 ]
          Hide
          Konstantin Shvachko added a comment -

          The image part of the patch looks good. I liked the seamless calculation of the checksum with DigestInputStream.

          The checkpoint part is a bit inconsistent.
          The CheckpointSignature is an invariant of the current checkpoint process. This is the way for the NN to recognize that it is still talking with the same checkpointer that started the process. Therefore, we should never alter the signature within the same checkpoint process.
          We should probably make the JavaDoc for CheckpointSignature more explicit on that.

          Current implementation is inconsistent in two ways.
          On the one hand, rollImage for SNN sends back a new checkpoint signature with the digest of the new image. This value is recorded by the NN as the new digest. Instead

          1. The SNN should send back the digest of the original (old) image, and the NN should verify it against the original signature by calling validateStorageInfo(), as it is done in endCheckpoint().
          2. The NN should calculate the digest by itself not relying on the value passed by SNN.
            It could probably be done while uploading the image from SNN.

          On the other hand, the BN/CN sends back the original checkpoint signature with the old image digest. And the NN records it as the new digest, which should lead to an error during reloading. Again the NN should always calculate the digest by itself, it should be the only authority for its own image.

          A couple of nits:

          • At the end of FSImage.loadFSImage(File) the same LOG message is printed twice.
          • Empty line added in FSImage.resetVersion()
          Show
          Konstantin Shvachko added a comment - The image part of the patch looks good. I liked the seamless calculation of the checksum with DigestInputStream. The checkpoint part is a bit inconsistent. The CheckpointSignature is an invariant of the current checkpoint process. This is the way for the NN to recognize that it is still talking with the same checkpointer that started the process. Therefore, we should never alter the signature within the same checkpoint process. We should probably make the JavaDoc for CheckpointSignature more explicit on that. Current implementation is inconsistent in two ways. On the one hand, rollImage for SNN sends back a new checkpoint signature with the digest of the new image. This value is recorded by the NN as the new digest. Instead The SNN should send back the digest of the original (old) image, and the NN should verify it against the original signature by calling validateStorageInfo(), as it is done in endCheckpoint(). The NN should calculate the digest by itself not relying on the value passed by SNN. It could probably be done while uploading the image from SNN. On the other hand, the BN/CN sends back the original checkpoint signature with the old image digest. And the NN records it as the new digest, which should lead to an error during reloading. Again the NN should always calculate the digest by itself, it should be the only authority for its own image. A couple of nits: At the end of FSImage.loadFSImage(File) the same LOG message is printed twice. Empty line added in FSImage.resetVersion()
          Hide
          Hairong Kuang added a comment -

          Konstantin, thank you so much for your review comments.

          > The NN should calculate the digest by itself not relying on the value passed by SNN. It could probably be done while uploading the image from SNN.
          As I pointed out in HDFS-1382, an image on disk or in transmission might get corrupt. So calculating checksum while uploading the image from SNN is not reliable at all. NN should depend on SNN to get the new image's checksum.

          > the NN records it as the new digest, which should lead to an error during reloading.
          I did not get this point. Why?

          Show
          Hairong Kuang added a comment - Konstantin, thank you so much for your review comments. > The NN should calculate the digest by itself not relying on the value passed by SNN. It could probably be done while uploading the image from SNN. As I pointed out in HDFS-1382 , an image on disk or in transmission might get corrupt. So calculating checksum while uploading the image from SNN is not reliable at all. NN should depend on SNN to get the new image's checksum. > the NN records it as the new digest, which should lead to an error during reloading. I did not get this point. Why?
          Hide
          Hairong Kuang added a comment -

          Oops, I got the wrong jira number. should be "as I pointed out in HDFS-1481".

          Show
          Hairong Kuang added a comment - Oops, I got the wrong jira number. should be "as I pointed out in HDFS-1481 ".
          Hairong Kuang made changes -
          Link This issue blocks HDFS-1481 [ HDFS-1481 ]
          Hide
          Hairong Kuang added a comment -

          I forgot to mention that I agree with comment 1. I will change rollFsImage to have two parameters, one representing the old checkpoint signature and the other representing the new image signature.

          Show
          Hairong Kuang added a comment - I forgot to mention that I agree with comment 1. I will change rollFsImage to have two parameters, one representing the old checkpoint signature and the other representing the new image signature.
          Hide
          Hairong Kuang added a comment -

          This patch added the old checkpoint signature as a parameter to rollFsImage. Similarly I added the new image signature as a parameter to endCheckpoint. Konstantin, does it make sense?

          I also removed the redundant logging and the unnecessary blank line.

          Show
          Hairong Kuang added a comment - This patch added the old checkpoint signature as a parameter to rollFsImage. Similarly I added the new image signature as a parameter to endCheckpoint. Konstantin, does it make sense? I also removed the redundant logging and the unnecessary blank line.
          Hairong Kuang made changes -
          Attachment trunkChecksumImage1.patch [ 12458579 ]
          Hide
          Konstantin Shvachko added a comment -

          I didn't know you discussed it already. I agree the image can be corrupted during the transmission.
          It seems logical if we include the verification logic in the transmission process. That is, SNN send the checksum via servlet, then NN uploads the image and calculate the downloaded checksum on the fly, and then matches it with the one sent by the SNN. The checksum verification can be done by validateCheckpointUpload(), which is already there just need to be extended.
          I don't think it will be a good idea to separate the upload and the verification, which is imminent if you first upload then send the checksum via rollFSImage() and verify inside.

          Show
          Konstantin Shvachko added a comment - I didn't know you discussed it already. I agree the image can be corrupted during the transmission. It seems logical if we include the verification logic in the transmission process. That is, SNN send the checksum via servlet, then NN uploads the image and calculate the downloaded checksum on the fly, and then matches it with the one sent by the SNN. The checksum verification can be done by validateCheckpointUpload(), which is already there just need to be extended. I don't think it will be a good idea to separate the upload and the verification, which is imminent if you first upload then send the checksum via rollFSImage() and verify inside.
          Hide
          Hairong Kuang added a comment -

          > It seems logical if we include the verification logic in the transmission process.
          This seems good to me. This was one of the options that I discussed with Dhruba for HDFS-1481.

          Probably in this jira, I will send the checksum together while uploading image. Then in HDFS-1481, I will add the verification part.

          Show
          Hairong Kuang added a comment - > It seems logical if we include the verification logic in the transmission process. This seems good to me. This was one of the options that I discussed with Dhruba for HDFS-1481 . Probably in this jira, I will send the checksum together while uploading image. Then in HDFS-1481 , I will add the verification part.
          Hide
          Konstantin Shvachko added a comment -

          Sounds like a plan.

          Show
          Konstantin Shvachko added a comment - Sounds like a plan.
          Hide
          Hairong Kuang added a comment -

          This patch addressed Konstantin's review comment.

          Show
          Hairong Kuang added a comment - This patch addressed Konstantin's review comment.
          Hairong Kuang made changes -
          Attachment trunkChecksumImage2.patch [ 12458800 ]
          Hide
          Konstantin Shvachko added a comment -

          Do you also need to set "&newChecksum=" in Checkpointer.uploadCheckpoint()? Should be the same as in SNN.putFSImage().

          Show
          Konstantin Shvachko added a comment - Do you also need to set "&newChecksum=" in Checkpointer.uploadCheckpoint()? Should be the same as in SNN.putFSImage().
          Hide
          Hairong Kuang added a comment -

          This patch made the change in Checkpointer as Konstantin suggested. Actually TestBackupNode caught this.

          The patch also fixed a subtle bug in TestSaveNameSpace caused by using spy. A spyed object does only a shallow copy of the original object. So when a new checksum is generated when saving the image to disk, the new value is set in the spyImage, but when saving the signature into VERSION file using StorageDirectory, it uses the value set in orignalImage. So reloading the image would fail. I fixed it by explicitly setting the storage directories in spyImage.

          Show
          Hairong Kuang added a comment - This patch made the change in Checkpointer as Konstantin suggested. Actually TestBackupNode caught this. The patch also fixed a subtle bug in TestSaveNameSpace caused by using spy. A spyed object does only a shallow copy of the original object. So when a new checksum is generated when saving the image to disk, the new value is set in the spyImage, but when saving the signature into VERSION file using StorageDirectory, it uses the value set in orignalImage. So reloading the image would fail. I fixed it by explicitly setting the storage directories in spyImage.
          Hairong Kuang made changes -
          Attachment trunkChecksumImage3.patch [ 12458933 ]
          Hide
          Hairong Kuang added a comment -

          antPatch.sh passed except for
          [exec] -1 release audit. The applied patch generated 97 release audit warnings (more than the trunk's current 1 warnings).
          The release warnings are all about license header. But my patch does not add any new file. I do think that there is a bug in the script.

          All unit tests are passed except for the known failed ones.

          Show
          Hairong Kuang added a comment - antPatch.sh passed except for [exec] -1 release audit. The applied patch generated 97 release audit warnings (more than the trunk's current 1 warnings). The release warnings are all about license header. But my patch does not add any new file. I do think that there is a bug in the script. All unit tests are passed except for the known failed ones.
          Hide
          Hairong Kuang added a comment -

          Fixed one more failed unit test.

          Show
          Hairong Kuang added a comment - Fixed one more failed unit test.
          Hairong Kuang made changes -
          Attachment trunkChecksumImage4.patch [ 12459026 ]
          Hide
          Konstantin Shvachko added a comment -

          +1 Looks good.

          Show
          Konstantin Shvachko added a comment - +1 Looks good.
          Hide
          Hairong Kuang added a comment -

          I've just committed this!

          Show
          Hairong Kuang added a comment - I've just committed this!
          Hairong Kuang made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Hadoop Flags [Incompatible change, Reviewed]
          Release Note Store fsimage MD5 checksum in VERSION file. Validate checksum when loading a fsimage. Layout version bumped.
          Resolution Fixed [ 1 ]
          Konstantin Boudnik made changes -
          Link This issue is related to HDFS-1496 [ HDFS-1496 ]
          Todd Lipcon made changes -
          Link This issue breaks HDFS-1500 [ HDFS-1500 ]
          Jeff Hammerbacher made changes -
          Link This issue relates to HDFS-1602 [ HDFS-1602 ]
          Hide
          SreeHari added a comment -

          With this change , Backupnode is downloading the image & edit files from namenode everytime since the difference in checkpoint time is always maintined b/w Namenode and Backupnode . This happens since Namenode is resetting its checkpoint time everytime since we are ignoring renewCheckpointTime and passing true explicitly to rollFsimage during endcheckpoint .. Isn't this a problem or am I missing something ?

          Show
          SreeHari added a comment - With this change , Backupnode is downloading the image & edit files from namenode everytime since the difference in checkpoint time is always maintined b/w Namenode and Backupnode . This happens since Namenode is resetting its checkpoint time everytime since we are ignoring renewCheckpointTime and passing true explicitly to rollFsimage during endcheckpoint .. Isn't this a problem or am I missing something ?
          Konstantin Shvachko made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Konstantin Shvachko made changes -
          Link This issue contains HDFS-52 [ HDFS-52 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          296d 7h 25m 1 Hairong Kuang 08/Nov/10 06:52
          Resolved Resolved Closed Closed
          398d 23h 26m 1 Konstantin Shvachko 12/Dec/11 06:19

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Eli Collins
            • Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development