Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-5042

Completed files lost after power failure

    Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      ext3 on CentOS 5.7 (kernel 2.6.18-274.el5)

      Description

      We suffered a cluster wide power failure after which HDFS lost data that it had acknowledged as closed and complete.

      The client was HBase which compacted a set of HFiles into a new HFile, then after closing the file successfully, deleted the previous versions of the file. The cluster then lost power, and when brought back up the newly created file was marked CORRUPT.

      Based on reading the logs it looks like the replicas were created by the DataNodes in the 'blocksBeingWritten' directory. Then when the file was closed they were moved to the 'current' directory. After the power cycle those replicas were again in the blocksBeingWritten directory of the underlying file system (ext3). When those DataNodes reported in to the NameNode it deleted those replicas and lost the file.

      Some possible fixes could be having the DataNode fsync the directory(s) after moving the block from blocksBeingWritten to current to ensure the rename is durable or having the NameNode accept replicas from blocksBeingWritten under certain circumstances.

      Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode):

      RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: Creating file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c with permission=rwxrwxrwx
      NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. blk_1395839728632046111_357084589
      DN 2013-06-29 11:16:06,832 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /10.0.5.237:50010
      NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to blk_1395839728632046111_357084589 size 25418340
      NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to blk_1395839728632046111_357084589 size 25418340
      NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to blk_1395839728632046111_357084589 size 25418340
      DN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327
      DN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block blk_1395839728632046111_357084589 terminating
      NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing lease on  file /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c from client DFSClient_hb_rs_hs745,60020,1372470111932
      NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: file /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c is closed by DFSClient_hb_rs_hs745,60020,1372470111932
      RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c to hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
      RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 7 file(s) in n of users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m
      
      -------  CRASH, RESTART ---------
      
      NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but was rejected: Reported as block being written but is a block of closed file.
      NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet of 10.0.6.1:50010
      NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 but was rejected: Reported as block being written but is a block of closed file.
      NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet of 10.0.5.237:50010
      NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but was rejected: Reported as block being written but is a block of closed file.
      NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet of 10.0.6.24:50010
      (note the clock on the server running DN is wrong after restart.  I believe timestamps are off by 6 hours:)
      DN 2013-06-29 06:07:22,877 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block blk_1395839728632046111_357084589 file /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for deletion
      DN 2013-06-29 06:07:24,952 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block blk_1395839728632046111_357084589 at file /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
      

      There was some additional discussion on this thread on the mailing list:

      http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201307.mbox/%3CCA+qbEUPuf19PL_EVeWi1104+scLVrcS0LTFUvBPw=qcuXnZ8hQ@mail.gmail.com%3E

        Issue Links

          Activity

          Hide
          tlipcon Todd Lipcon added a comment -

          I think this is expected behavior. To void this, you probably need to set the "sync on close" option. I'd also recommend the "sync.behind.writes" option if you enable "sync on close" to avoid bursty IO. With those two options together, you'll have a bit of a performance impact, but not awful, and you'll avoid data loss on complete power failure.

          Show
          tlipcon Todd Lipcon added a comment - I think this is expected behavior. To void this, you probably need to set the "sync on close" option. I'd also recommend the "sync.behind.writes" option if you enable "sync on close" to avoid bursty IO. With those two options together, you'll have a bit of a performance impact, but not awful, and you'll avoid data loss on complete power failure.
          Hide
          davelatham Dave Latham added a comment -

          Thanks, Todd, for the suggestions. Do you know if "sync on close" not only syncs the file but also the directory entry after the rename? My impression would be that the file gets closed and synced, but that then rename happens from blocksBeingWritten to current and if the directory is not also fsynced also synced after the rename then the above scenario will still happen.

          Show
          davelatham Dave Latham added a comment - Thanks, Todd, for the suggestions. Do you know if "sync on close" not only syncs the file but also the directory entry after the rename? My impression would be that the file gets closed and synced, but that then rename happens from blocksBeingWritten to current and if the directory is not also fsynced also synced after the rename then the above scenario will still happen.
          Hide
          tlipcon Todd Lipcon added a comment -

          That's a good point. I don't think it does sync the directory, since that's AFAIK not possible in Java

          Show
          tlipcon Todd Lipcon added a comment - That's a good point. I don't think it does sync the directory, since that's AFAIK not possible in Java
          Hide
          davelatham Dave Latham added a comment -

          So perhaps the best course then is to have the NameNode accept block reports from blocksBeingWritten under certain circumstances in order to guarantee durability.

          Show
          davelatham Dave Latham added a comment - So perhaps the best course then is to have the NameNode accept block reports from blocksBeingWritten under certain circumstances in order to guarantee durability.
          Hide
          vicaya Luke Lu added a comment -

          We've also reproduced the data loss (though with less frequency) even with dfs.datanode.synconclose set to true.

          We should do something along the line of CASSANDRA-3250 in the data node when "sync on close" is turned on.

          Show
          vicaya Luke Lu added a comment - We've also reproduced the data loss (though with less frequency) even with dfs.datanode.synconclose set to true. We should do something along the line of CASSANDRA-3250 in the data node when "sync on close" is turned on.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Is this a problem when enabling write barriers on the DNs? EXT3 has them off by default.
          In that case we might need to move the file in place first and then fsync the file, that should force the meta updates in order... I'm sure that'd cause other problems.

          Show
          lhofhansl Lars Hofhansl added a comment - Is this a problem when enabling write barriers on the DNs? EXT3 has them off by default. In that case we might need to move the file in place first and then fsync the file, that should force the meta updates in order... I'm sure that'd cause other problems.
          Hide
          vicaya Luke Lu added a comment -

          We actually reproduce the problem (pretty easily: a couple times out of 10) on ext4 with barrier enabled (you need nobarrier to turn it off).

          BTW, move the file first defeat the purpose of the atomic rename. We should fsync both (file and directory) and check the result of both before return the status of close.

          Show
          vicaya Luke Lu added a comment - We actually reproduce the problem (pretty easily: a couple times out of 10) on ext4 with barrier enabled (you need nobarrier to turn it off). BTW, move the file first defeat the purpose of the atomic rename. We should fsync both (file and directory) and check the result of both before return the status of close.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Thanks Luke.

          I meant to say: (1) finish writing the block. (2) Move it. (3) fsync or fdatasync the block file in the new location.
          (We'd just change the order of moving vs. fsync.)

          The rename would still be atomic (file block is written completely before we move it), but doing the fsync after should order the meta data commits correctly assuming write barriers. Then again the write and the move would be two different transactions as far as the fs is concerned.

          Agree it's cleanest if we in fact sync both actions.

          Show
          lhofhansl Lars Hofhansl added a comment - Thanks Luke. I meant to say: (1) finish writing the block. (2) Move it. (3) fsync or fdatasync the block file in the new location. (We'd just change the order of moving vs. fsync.) The rename would still be atomic (file block is written completely before we move it), but doing the fsync after should order the meta data commits correctly assuming write barriers. Then again the write and the move would be two different transactions as far as the fs is concerned. Agree it's cleanest if we in fact sync both actions.
          Hide
          vicaya Luke Lu added a comment -

          The write barrier support, as I understand it, is strictly used to flush device/disk cache, which is not actually relevant here.

          You're trying to rely on fs impl detail to minimize fsync. OTOH, I think I might have found a real work-around for ext3/ext4, use dirsync mount option in conjunction with "sync on close".

          Show
          vicaya Luke Lu added a comment - The write barrier support, as I understand it, is strictly used to flush device/disk cache, which is not actually relevant here. You're trying to rely on fs impl detail to minimize fsync. OTOH, I think I might have found a real work-around for ext3/ext4, use dirsync mount option in conjunction with "sync on close".
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Cool. That should work.

          Show
          lhofhansl Lars Hofhansl added a comment - Cool. That should work.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          We should study the perf impact.

          Previously I found that sync-on-close severely impacted file creation time - unless sync-behind-writes is also enabled. (Interestingly sync-behind-writes should not cause any performance detriment as we're dealing with immutable files, and hence delaying writing these dirty blocks to disk in the hopes that they'd be updated before we do so is pointless anyway).

          Show
          lhofhansl Lars Hofhansl added a comment - We should study the perf impact. Previously I found that sync-on-close severely impacted file creation time - unless sync-behind-writes is also enabled. (Interestingly sync-behind-writes should not cause any performance detriment as we're dealing with immutable files, and hence delaying writing these dirty blocks to disk in the hopes that they'd be updated before we do so is pointless anyway).
          Hide
          vik.karma Vikas Vishwakarma added a comment -

          Updating the results for 1000 files DFSIO write test run with dirsync enabled/disabled and results for both runs are similar except for a very small diff in Avg IO rate of -2% with dirsync enabled.

          DFSIO Test:
          ./hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.3.0-cdh5.0.1-sfdc-2.0.0-tests.jar TestDFSIO -write -nrFiles 1000 -fileSize 1000

          Env1 (Datanodes:15, Containers:29, mount options: rw,dirsync,noatime)
          ===========================================================
          14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): ----- TestDFSIO ----- : write
          14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Date & time: Tue Oct 14 04:11:47 GMT+00:00 2014
          14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Number of files: 1000
          14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Total MBytes processed: 1000000.0
          14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Throughput mb/sec: 21.114069074888118
          14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Average IO rate mb/sec: 21.986719131469727
          14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): IO rate std deviation: 4.507905485162703
          14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Test exec time sec: 1937.989
          14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855):

          Env2 (Datanodes:15, Containers:29, mount options: rw,noatime)
          ====================================================
          14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): ----- TestDFSIO ----- : write
          14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Date & time: Tue Oct 14 04:32:25 GMT 2014
          14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Number of files: 1000
          14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Total MBytes processed: 1000000.0
          14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Throughput mb/sec: 21.391594681989666
          14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Average IO rate mb/sec: 22.406478881835938
          14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): IO rate std deviation: 5.169537520933585
          14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Test exec time sec: 1872.904
          14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855):

          Show
          vik.karma Vikas Vishwakarma added a comment - Updating the results for 1000 files DFSIO write test run with dirsync enabled/disabled and results for both runs are similar except for a very small diff in Avg IO rate of -2% with dirsync enabled. DFSIO Test: ./hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.3.0-cdh5.0.1-sfdc-2.0.0-tests.jar TestDFSIO -write -nrFiles 1000 -fileSize 1000 Env1 (Datanodes:15, Containers:29, mount options: rw,dirsync,noatime) =========================================================== 14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): ----- TestDFSIO ----- : write 14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Date & time: Tue Oct 14 04:11:47 GMT+00:00 2014 14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Number of files: 1000 14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Total MBytes processed: 1000000.0 14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Throughput mb/sec: 21.114069074888118 14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Average IO rate mb/sec: 21.986719131469727 14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): IO rate std deviation: 4.507905485162703 14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Test exec time sec: 1937.989 14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Env2 (Datanodes:15, Containers:29, mount options: rw,noatime) ==================================================== 14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): ----- TestDFSIO ----- : write 14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Date & time: Tue Oct 14 04:32:25 GMT 2014 14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Number of files: 1000 14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Total MBytes processed: 1000000.0 14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Throughput mb/sec: 21.391594681989666 14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Average IO rate mb/sec: 22.406478881835938 14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): IO rate std deviation: 5.169537520933585 14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Test exec time sec: 1872.904 14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855):
          Hide
          kihwal Kihwal Lee added a comment -

          We can make it optional and put in after HDFS-8791.

          Show
          kihwal Kihwal Lee added a comment - We can make it optional and put in after HDFS-8791 .
          Hide
          surendrasingh Surendra Singh Lilhore added a comment -

          We also faced the same problem. Can we recover this kind of block from namenode after getting block report?
          If reported block genstamp and size is matching with the namenode in memory metadata then NameNode can send a command to datanode to recover from wrong replica state.

          Show
          surendrasingh Surendra Singh Lilhore added a comment - We also faced the same problem. Can we recover this kind of block from namenode after getting block report? If reported block genstamp and size is matching with the namenode in memory metadata then NameNode can send a command to datanode to recover from wrong replica state.
          Hide
          kihwal Kihwal Lee added a comment -

          Can we recover this kind of block from namenode after getting block report?

          In most cases, these block files will be 0 byte after reboot. The file system has already lost data, so there is nothing NN or DN can do. The solution is to sync the directory entry.

          Show
          kihwal Kihwal Lee added a comment - Can we recover this kind of block from namenode after getting block report? In most cases, these block files will be 0 byte after reboot. The file system has already lost data, so there is nothing NN or DN can do. The solution is to sync the directory entry.

            People

            • Assignee:
              Unassigned
              Reporter:
              davelatham Dave Latham
            • Votes:
              0 Vote for this issue
              Watchers:
              27 Start watching this issue

              Dates

              • Created:
                Updated:

                Development