Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-957

FSImage layout version should be only once file is complete

    Details

    • Type: Improvement Improvement
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.22.0
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:
      None

      Description

      Right now, the FSImage save code writes the LAYOUT_VERSION at the head of the file, along with some other headers, and then dumps the directory into the file. Instead, it should write a special IMAGE_IN_PROGRESS entry for the layout version, dump all of the data, then seek back to the head of the file to write the proper LAYOUT_VERSION. This would make it very easy to detect the case where the FSImage save got interrupted.

      1. hdfs-957.txt
        4 kB
        Todd Lipcon

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          This probably could do with some new unit tests, but wanted to attach here for early comment to see if people are on board with the idea.

          Show
          Todd Lipcon added a comment - This probably could do with some new unit tests, but wanted to attach here for early comment to see if people are on board with the idea.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12435248/hdfs-957.txt
          against trunk revision 907550.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/221/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/221/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/221/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/221/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12435248/hdfs-957.txt against trunk revision 907550. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/221/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/221/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/221/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/221/console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          The failed test is something security-related.

          Any thoughts on this?

          Show
          Todd Lipcon added a comment - The failed test is something security-related. Any thoughts on this?
          Hide
          dhruba borthakur added a comment -

          Hi Todd, Can the NN write a trailer record to the image to indicate that the entire file is successfully written? The trailer can also have the CRC, etc of the file contents.

          Show
          dhruba borthakur added a comment - Hi Todd, Can the NN write a trailer record to the image to indicate that the entire file is successfully written? The trailer can also have the CRC, etc of the file contents.
          Hide
          Todd Lipcon added a comment -

          Hey Dhruba,

          Trailer works too, but that's a file format change. By putting a special value at the head, then seeking back to replace it with the real layout, we have the same effect but don't change file format at all.

          As for CRC I think that should be done in HDFS-903. I wanted to keep this JIRA a very small change and leave the big work for 903

          Show
          Todd Lipcon added a comment - Hey Dhruba, Trailer works too, but that's a file format change. By putting a special value at the head, then seeking back to replace it with the real layout, we have the same effect but don't change file format at all. As for CRC I think that should be done in HDFS-903 . I wanted to keep this JIRA a very small change and leave the big work for 903
          Hide
          dhruba borthakur added a comment -

          Ok, I see your point.

          Suppose the NN issues the write system call to write IMAGE_IN_PROGRESS to the header initially. Then it completes writing all the data to the file and then seeks back to the layout version and then writes the correct LAYOUT VERSION. Then the NN tries to close the file. Meanwhile, all these writes were buffered in the OS buffers. Now, the NN closes the file and the header sector that has the correct LAYOUT VERSION is flushed to the disk whereas some other pages of the file encountered an error while being flushed. This kind of errors could be detected by HDFS-903, isn't it?

          The other question is that the device that shall store the FSImage now needs to be Seekable. Is this the case earlier too?

          +1 for this patch.

          Show
          dhruba borthakur added a comment - Ok, I see your point. Suppose the NN issues the write system call to write IMAGE_IN_PROGRESS to the header initially. Then it completes writing all the data to the file and then seeks back to the layout version and then writes the correct LAYOUT VERSION. Then the NN tries to close the file. Meanwhile, all these writes were buffered in the OS buffers. Now, the NN closes the file and the header sector that has the correct LAYOUT VERSION is flushed to the disk whereas some other pages of the file encountered an error while being flushed. This kind of errors could be detected by HDFS-903 , isn't it? The other question is that the device that shall store the FSImage now needs to be Seekable. Is this the case earlier too? +1 for this patch.
          Hide
          Todd Lipcon added a comment -

          Now, the NN closes the file and the header sector that has the correct LAYOUT VERSION is flushed to the disk whereas some other pages of the file encountered an error while being flushed

          In the attached patch I actually close the file, then re-open with RandomAccessFile to set the header correct. My thinking was that the close() would flush – it's probably not guaranteed, but my thinking is that with most journaled filesystems (eg ext3 w/ data=ordered) the data must be written before the metadata transaction will commit to disk. Therefore on an OS crash or power outage the file will either not have committed its metadata, in which case it will appear empty or not at all, or it will have committed its metadata after flushing the complete data. Whether the rewritten correct LAYOUT_VERSION has been synced is unknown, but I think the ordering is going to be correct.

          That said, we should probably stick a FileChannel.force() in there after both writes to be extra safe. It may hurt performance a little bit, but I think this is one of those areas where we should err on the side of safety, yea?

          The other question is that the device that shall store the FSImage now needs to be Seekable. Is this the case earlier too?

          The current code simply uses a FileOutputStream - there's no abstraction going on. So it is assuming the existence of a normal filesystem with random access. For the edit logs, we can't assume seek currently, but I think it's a fair assumption for the images.

          Show
          Todd Lipcon added a comment - Now, the NN closes the file and the header sector that has the correct LAYOUT VERSION is flushed to the disk whereas some other pages of the file encountered an error while being flushed In the attached patch I actually close the file, then re-open with RandomAccessFile to set the header correct. My thinking was that the close() would flush – it's probably not guaranteed, but my thinking is that with most journaled filesystems (eg ext3 w/ data=ordered) the data must be written before the metadata transaction will commit to disk. Therefore on an OS crash or power outage the file will either not have committed its metadata, in which case it will appear empty or not at all, or it will have committed its metadata after flushing the complete data. Whether the rewritten correct LAYOUT_VERSION has been synced is unknown, but I think the ordering is going to be correct. That said, we should probably stick a FileChannel.force() in there after both writes to be extra safe. It may hurt performance a little bit, but I think this is one of those areas where we should err on the side of safety, yea? The other question is that the device that shall store the FSImage now needs to be Seekable. Is this the case earlier too? The current code simply uses a FileOutputStream - there's no abstraction going on. So it is assuming the existence of a normal filesystem with random access. For the edit logs, we can't assume seek currently, but I think it's a fair assumption for the images.
          Hide
          dhruba borthakur added a comment -

          I like your idea of using FileChannel.force().

          Show
          dhruba borthakur added a comment - I like your idea of using FileChannel.force().
          Hide
          Konstantin Shvachko added a comment -
          1. I don't understand what problem this solves. We always save new image into IMAGE_NEW. Then we rename it to IMAGE. If IMAGE_NEW exists we treat it as if the save did not complete (unsuccessful checkpoint or save). IMAGE_NEW then is simply removed and the old image in IMAGE file is used as the current image. There is no need to track progress of writing into IMAGE_NEW.
          2. It is very important in upgrade that we do not write anything into VERSION file until IMAGE file is written. Writing VERSION file is like a commit when the image directory is formed. We first write everything else then VERSION file, which indicates the directory is in good order.
            If you create VERSION file before completing the image all this logic will not work.
          Show
          Konstantin Shvachko added a comment - I don't understand what problem this solves. We always save new image into IMAGE_NEW. Then we rename it to IMAGE. If IMAGE_NEW exists we treat it as if the save did not complete (unsuccessful checkpoint or save). IMAGE_NEW then is simply removed and the old image in IMAGE file is used as the current image. There is no need to track progress of writing into IMAGE_NEW. It is very important in upgrade that we do not write anything into VERSION file until IMAGE file is written. Writing VERSION file is like a commit when the image directory is formed. We first write everything else then VERSION file, which indicates the directory is in good order. If you create VERSION file before completing the image all this logic will not work.
          Hide
          Todd Lipcon added a comment -

          I don't understand what problem this solves...

          Looking at HDFS-955, it's clear that the current recovery mechanisms are not entirely complete. I'm for adding this as another safety guard / sanity check. It costs essentially nothing and makes absolutely sure we never try to read an unfinished image.

          In particular, with this patch we'd be able to fix the issue raised here fairly trivially. Rather than always assuming that an fsimage.ckpt file is incomplete, we could easily recover with no ambiguity.

          It is very important in upgrade that we do not write anything into VERSION file until IMAGE file is written

          This doesn't propose to change that. We're simply changing the order in which we write the IMAGE file - it doesn't change any ordering with regard to the other metadata files.

          Show
          Todd Lipcon added a comment - I don't understand what problem this solves... Looking at HDFS-955 , it's clear that the current recovery mechanisms are not entirely complete. I'm for adding this as another safety guard / sanity check. It costs essentially nothing and makes absolutely sure we never try to read an unfinished image. In particular, with this patch we'd be able to fix the issue raised here fairly trivially. Rather than always assuming that an fsimage.ckpt file is incomplete, we could easily recover with no ambiguity. It is very important in upgrade that we do not write anything into VERSION file until IMAGE file is written This doesn't propose to change that. We're simply changing the order in which we write the IMAGE file - it doesn't change any ordering with regard to the other metadata files.
          Hide
          Konstantin Shvachko added a comment -

          Sorry, you are not writing it to VERSION file. Taking back my second point. I thought you do, because LAYOUT_VERSION is just a redundant field in fsimage.

          Show
          Konstantin Shvachko added a comment - Sorry, you are not writing it to VERSION file. Taking back my second point. I thought you do, because LAYOUT_VERSION is just a redundant field in fsimage.
          Hide
          Konstantin Shvachko added a comment -

          > It costs essentially nothing and makes absolutely sure we never try to read an unfinished image.

          But we never do, already. We remove IMAGE_NEW whenever it exists not even trying to open it. I don't think we need more complexity here.

          Show
          Konstantin Shvachko added a comment - > It costs essentially nothing and makes absolutely sure we never try to read an unfinished image. But we never do, already. We remove IMAGE_NEW whenever it exists not even trying to open it. I don't think we need more complexity here.
          Show
          Todd Lipcon added a comment - See my motivation for the ability to tell if an image is complete here: https://issues.apache.org/jira/browse/HDFS-955?focusedCommentId=12832212&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12832212
          Hide
          Konstantin Shvachko added a comment -

          Should we close this with HDFS-955 in?

          Show
          Konstantin Shvachko added a comment - Should we close this with HDFS-955 in?
          Hide
          Todd Lipcon added a comment -

          I think this is still a good idea, even with that bug fixed. The extra safeguard doesn't really cost us anything, and the fsync() is important when using a FS with delayed allocation.

          Show
          Todd Lipcon added a comment - I think this is still a good idea, even with that bug fixed. The extra safeguard doesn't really cost us anything, and the fsync() is important when using a FS with delayed allocation.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12435248/hdfs-957.txt
          against trunk revision 1051669.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/30//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12435248/hdfs-957.txt against trunk revision 1051669. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/30//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12435248/hdfs-957.txt
          against trunk revision 1072023.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/198//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12435248/hdfs-957.txt against trunk revision 1072023. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/198//console This message is automatically generated.

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:

                Development