Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-970

FSImage writing should always fsync before close

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.20.1, 0.21.0, 0.22.0
    • Fix Version/s: 0.22.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Without an fsync, it's common that filesystems will delay the writing of metadata to the journal until all of the data blocks have been flushed. If the system crashes while the dirty pages haven't been flushed, the file is left in an indeterminate state. In some FSs (eg ext4) this will result in a 0-length file. In others (eg XFS) it will result in the correct length but any number of data blocks getting zeroed. Calling FileChannel.force before closing the FSImage prevents this issue.

      1. hdfs-970.txt
        2 kB
        Todd Lipcon

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          Relevant LKML thread:
          http://lkml.indiana.edu/hypermail/linux/kernel/0903.3/01327.html

          Note that the rename for atomic commits trick originated in mail severs
          which always did the proper fsync. When the word spread into the
          desktop world it looks like this wisdom got lost.

          Show
          Todd Lipcon added a comment - Relevant LKML thread: http://lkml.indiana.edu/hypermail/linux/kernel/0903.3/01327.html Note that the rename for atomic commits trick originated in mail severs which always did the proper fsync. When the word spread into the desktop world it looks like this wisdom got lost.
          Hide
          dhruba borthakur added a comment -

          This is a good thing to do. This could increase the namenode restart times, but would prevent corruption of fsimage. Ya!

          Show
          dhruba borthakur added a comment - This is a good thing to do. This could increase the namenode restart times, but would prevent corruption of fsimage. Ya!
          Hide
          Todd Lipcon added a comment -

          Trivial patch to add fsyncs to the image saving, fstime writing, and TransferFsImage code. There are no tests included, since the only way to test this is to pull a power plug

          There is probably some negative performance impact, depending on size of files, dirty page limits, available RAM, etc, but I think the safety factor is well worth it!

          Show
          Todd Lipcon added a comment - Trivial patch to add fsyncs to the image saving, fstime writing, and TransferFsImage code. There are no tests included, since the only way to test this is to pull a power plug There is probably some negative performance impact, depending on size of files, dirty page limits, available RAM, etc, but I think the safety factor is well worth it!
          Hide
          dhruba borthakur added a comment -

          +1 code looks good. we should get this into 0.21 as well.

          Show
          dhruba borthakur added a comment - +1 code looks good. we should get this into 0.21 as well.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12444191/hdfs-970.txt
          against trunk revision 942863.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/353/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/353/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/353/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/353/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12444191/hdfs-970.txt against trunk revision 942863. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/353/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/353/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/353/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/353/console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          To prove that this is indeed absolutely necessary, I performed the following test on my desktop (2.6.31):

          1) Create a loop device with 1G storage

          # dd if=/dev/zero of=myloop bs=1M count=1000
          # losetup -f myloop

          2) Make a "faulty" type md array:

          # mdadm --create /dev/md0 --level=faulty --raid-devices=1  /dev/loop1

          3) format it as ext4

          # mkfs.ext4 /dev/md0

          4) mount it in /mnt

          # mount -t ext4 /dev/md0 /mnt

          5) run the following python script:

          #!/usr/bin/env python
          import os
          
          for idx in xrange(1, 100000):
            f = file("file_%d_ckpt" % idx, "w")
            for line in xrange(0, 1000000):
              print >>f, "hello world! this is line %d " % line
            f.close()
            os.rename("file_%d_ckpt" % idx, "file_%d" % idx)
            print "Saved file %d" % idx
          

          6) While running, block all writes to the disk (this essentially freezes the disk as if a power outage occurred):

          # mdadm --grow /dev/md0 -l faulty -p write-all

          Script output:

          Saved file 1
          Saved file 2
          Saved file 3
          Saved file 4
          Saved file 5
          Traceback (most recent call last):
            File "/home/todd/disk-fault/test.py", line 7, in <module>
              print >>f, "hello world! this is line %d " % line
          IOError: [Errno 30] Read-only file system
          

          [ext4 automatically remounts itself readonly]
          7) umount /mnt, clear the fault with -p clear, remount /mnt
          8) results of ls -l:

          root@todd-desktop:/mnt# ls -l
          total 16
          -rw-r--r-- 1 root root     0 2010-05-13 23:11 file_1
          -rw-r--r-- 1 root root     0 2010-05-13 23:11 file_2
          -rw-r--r-- 1 root root     0 2010-05-13 23:11 file_3_ckpt
          drwx------ 2 root root 16384 2010-05-13 22:45 lost+found
          

          I then modified the test script to add f.flush() and os.fsync(f.fileno()) right before the close(), and ran the exact same test. Results:

          root@todd-desktop:/mnt# ~/disk-fault/test.py 
          Saved file 1
          Saved file 2
          Traceback (most recent call last):
            File "/home/todd/disk-fault/test.py", line 9, in <module>
              os.fsync(f.fileno())
          
          [umount, clear fault, remount]
          
          root@todd-desktop:/mnt# ls -l
          total 66208
          -rw-r--r-- 1 root root 33888890 2010-05-13 23:20 file_1
          -rw-r--r-- 1 root root 33888890 2010-05-13 23:20 file_2_ckpt
          drwx------ 2 root root    16384 2010-05-13 22:45 lost+found
          

          I tried the same test on ext3, and without the fsync the files entirely disappeared. The same was true of xfs. Adding the fsync before close fixed the issue in all cases.

          Show
          Todd Lipcon added a comment - To prove that this is indeed absolutely necessary, I performed the following test on my desktop (2.6.31): 1) Create a loop device with 1G storage # dd if =/dev/zero of=myloop bs=1M count=1000 # losetup -f myloop 2) Make a "faulty" type md array: # mdadm --create /dev/md0 --level=faulty --raid-devices=1 /dev/loop1 3) format it as ext4 # mkfs.ext4 /dev/md0 4) mount it in /mnt # mount -t ext4 /dev/md0 /mnt 5) run the following python script: #!/usr/bin/env python import os for idx in xrange(1, 100000): f = file( "file_%d_ckpt" % idx, "w" ) for line in xrange(0, 1000000): print >>f, "hello world! this is line %d " % line f.close() os.rename( "file_%d_ckpt" % idx, "file_%d" % idx) print "Saved file %d" % idx 6) While running, block all writes to the disk (this essentially freezes the disk as if a power outage occurred): # mdadm --grow /dev/md0 -l faulty -p write-all Script output: Saved file 1 Saved file 2 Saved file 3 Saved file 4 Saved file 5 Traceback (most recent call last): File "/home/todd/disk-fault/test.py" , line 7, in <module> print >>f, "hello world! this is line %d " % line IOError: [Errno 30] Read-only file system [ext4 automatically remounts itself readonly] 7) umount /mnt, clear the fault with -p clear, remount /mnt 8) results of ls -l: root@todd-desktop:/mnt# ls -l total 16 -rw-r--r-- 1 root root 0 2010-05-13 23:11 file_1 -rw-r--r-- 1 root root 0 2010-05-13 23:11 file_2 -rw-r--r-- 1 root root 0 2010-05-13 23:11 file_3_ckpt drwx------ 2 root root 16384 2010-05-13 22:45 lost+found I then modified the test script to add f.flush() and os.fsync(f.fileno()) right before the close(), and ran the exact same test. Results: root@todd-desktop:/mnt# ~/disk-fault/test.py Saved file 1 Saved file 2 Traceback (most recent call last): File "/home/todd/disk-fault/test.py" , line 9, in <module> os.fsync(f.fileno()) [umount, clear fault, remount] root@todd-desktop:/mnt# ls -l total 66208 -rw-r--r-- 1 root root 33888890 2010-05-13 23:20 file_1 -rw-r--r-- 1 root root 33888890 2010-05-13 23:20 file_2_ckpt drwx------ 2 root root 16384 2010-05-13 22:45 lost+found I tried the same test on ext3, and without the fsync the files entirely disappeared. The same was true of xfs. Adding the fsync before close fixed the issue in all cases.
          Hide
          dhruba borthakur added a comment -

          I just committed this, thanks Todd.

          Show
          dhruba borthakur added a comment - I just committed this, thanks Todd.

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development