Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-5042

Completed files lost after power failure

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2
    • Component/s: None
    • Labels:
      None
    • Environment:

      ext3 on CentOS 5.7 (kernel 2.6.18-274.el5)

    • Hadoop Flags:
      Reviewed

      Description

      We suffered a cluster wide power failure after which HDFS lost data that it had acknowledged as closed and complete.

      The client was HBase which compacted a set of HFiles into a new HFile, then after closing the file successfully, deleted the previous versions of the file. The cluster then lost power, and when brought back up the newly created file was marked CORRUPT.

      Based on reading the logs it looks like the replicas were created by the DataNodes in the 'blocksBeingWritten' directory. Then when the file was closed they were moved to the 'current' directory. After the power cycle those replicas were again in the blocksBeingWritten directory of the underlying file system (ext3). When those DataNodes reported in to the NameNode it deleted those replicas and lost the file.

      Some possible fixes could be having the DataNode fsync the directory(s) after moving the block from blocksBeingWritten to current to ensure the rename is durable or having the NameNode accept replicas from blocksBeingWritten under certain circumstances.

      Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode):

      RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: Creating file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c with permission=rwxrwxrwx
      NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. blk_1395839728632046111_357084589
      DN 2013-06-29 11:16:06,832 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /10.0.5.237:50010
      NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to blk_1395839728632046111_357084589 size 25418340
      NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to blk_1395839728632046111_357084589 size 25418340
      NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to blk_1395839728632046111_357084589 size 25418340
      DN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327
      DN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block blk_1395839728632046111_357084589 terminating
      NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing lease on  file /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c from client DFSClient_hb_rs_hs745,60020,1372470111932
      NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: file /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c is closed by DFSClient_hb_rs_hs745,60020,1372470111932
      RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c to hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
      RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 7 file(s) in n of users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m
      
      -------  CRASH, RESTART ---------
      
      NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but was rejected: Reported as block being written but is a block of closed file.
      NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet of 10.0.6.1:50010
      NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 but was rejected: Reported as block being written but is a block of closed file.
      NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet of 10.0.5.237:50010
      NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but was rejected: Reported as block being written but is a block of closed file.
      NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet of 10.0.6.24:50010
      (note the clock on the server running DN is wrong after restart.  I believe timestamps are off by 6 hours:)
      DN 2013-06-29 06:07:22,877 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block blk_1395839728632046111_357084589 file /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for deletion
      DN 2013-06-29 06:07:24,952 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block blk_1395839728632046111_357084589 at file /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
      

      There was some additional discussion on this thread on the mailing list:

      http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201307.mbox/%3CCA+qbEUPuf19PL_EVeWi1104+scLVrcS0LTFUvBPw=qcuXnZ8hQ@mail.gmail.com%3E

      1. HDFS-5042-branch-2-05.patch
        18 kB
        Vinayakumar B
      2. HDFS-5042-branch-2-01.patch
        18 kB
        Vinayakumar B
      3. HDFS-5042-branch-2.8-addendum.patch
        5 kB
        Vinayakumar B
      4. HDFS-5042-branch-2.8-06.patch
        16 kB
        Vinayakumar B
      5. HDFS-5042-branch-2.8-05.patch
        16 kB
        Vinayakumar B
      6. HDFS-5042-branch-2.7-06.patch
        15 kB
        Vinayakumar B
      7. HDFS-5042-branch-2.7-05.patch
        15 kB
        Vinayakumar B
      8. HDFS-5042-05-branch-2.patch
        18 kB
        Vinayakumar B
      9. HDFS-5042-05.patch
        18 kB
        Vinayakumar B
      10. HDFS-5042-04.patch
        18 kB
        Vinayakumar B
      11. HDFS-5042-03.patch
        18 kB
        Vinayakumar B
      12. HDFS-5042-02.patch
        19 kB
        Vinayakumar B
      13. HDFS-5042-01.patch
        18 kB
        Vinayakumar B

        Issue Links

          Activity

          Hide
          tlipcon Todd Lipcon added a comment -

          I think this is expected behavior. To void this, you probably need to set the "sync on close" option. I'd also recommend the "sync.behind.writes" option if you enable "sync on close" to avoid bursty IO. With those two options together, you'll have a bit of a performance impact, but not awful, and you'll avoid data loss on complete power failure.

          Show
          tlipcon Todd Lipcon added a comment - I think this is expected behavior. To void this, you probably need to set the "sync on close" option. I'd also recommend the "sync.behind.writes" option if you enable "sync on close" to avoid bursty IO. With those two options together, you'll have a bit of a performance impact, but not awful, and you'll avoid data loss on complete power failure.
          Hide
          davelatham Dave Latham added a comment -

          Thanks, Todd, for the suggestions. Do you know if "sync on close" not only syncs the file but also the directory entry after the rename? My impression would be that the file gets closed and synced, but that then rename happens from blocksBeingWritten to current and if the directory is not also fsynced also synced after the rename then the above scenario will still happen.

          Show
          davelatham Dave Latham added a comment - Thanks, Todd, for the suggestions. Do you know if "sync on close" not only syncs the file but also the directory entry after the rename? My impression would be that the file gets closed and synced, but that then rename happens from blocksBeingWritten to current and if the directory is not also fsynced also synced after the rename then the above scenario will still happen.
          Hide
          tlipcon Todd Lipcon added a comment -

          That's a good point. I don't think it does sync the directory, since that's AFAIK not possible in Java

          Show
          tlipcon Todd Lipcon added a comment - That's a good point. I don't think it does sync the directory, since that's AFAIK not possible in Java
          Hide
          davelatham Dave Latham added a comment -

          So perhaps the best course then is to have the NameNode accept block reports from blocksBeingWritten under certain circumstances in order to guarantee durability.

          Show
          davelatham Dave Latham added a comment - So perhaps the best course then is to have the NameNode accept block reports from blocksBeingWritten under certain circumstances in order to guarantee durability.
          Hide
          vicaya Luke Lu added a comment -

          We've also reproduced the data loss (though with less frequency) even with dfs.datanode.synconclose set to true.

          We should do something along the line of CASSANDRA-3250 in the data node when "sync on close" is turned on.

          Show
          vicaya Luke Lu added a comment - We've also reproduced the data loss (though with less frequency) even with dfs.datanode.synconclose set to true. We should do something along the line of CASSANDRA-3250 in the data node when "sync on close" is turned on.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Is this a problem when enabling write barriers on the DNs? EXT3 has them off by default.
          In that case we might need to move the file in place first and then fsync the file, that should force the meta updates in order... I'm sure that'd cause other problems.

          Show
          lhofhansl Lars Hofhansl added a comment - Is this a problem when enabling write barriers on the DNs? EXT3 has them off by default. In that case we might need to move the file in place first and then fsync the file, that should force the meta updates in order... I'm sure that'd cause other problems.
          Hide
          vicaya Luke Lu added a comment -

          We actually reproduce the problem (pretty easily: a couple times out of 10) on ext4 with barrier enabled (you need nobarrier to turn it off).

          BTW, move the file first defeat the purpose of the atomic rename. We should fsync both (file and directory) and check the result of both before return the status of close.

          Show
          vicaya Luke Lu added a comment - We actually reproduce the problem (pretty easily: a couple times out of 10) on ext4 with barrier enabled (you need nobarrier to turn it off). BTW, move the file first defeat the purpose of the atomic rename. We should fsync both (file and directory) and check the result of both before return the status of close.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Thanks Luke.

          I meant to say: (1) finish writing the block. (2) Move it. (3) fsync or fdatasync the block file in the new location.
          (We'd just change the order of moving vs. fsync.)

          The rename would still be atomic (file block is written completely before we move it), but doing the fsync after should order the meta data commits correctly assuming write barriers. Then again the write and the move would be two different transactions as far as the fs is concerned.

          Agree it's cleanest if we in fact sync both actions.

          Show
          lhofhansl Lars Hofhansl added a comment - Thanks Luke. I meant to say: (1) finish writing the block. (2) Move it. (3) fsync or fdatasync the block file in the new location. (We'd just change the order of moving vs. fsync.) The rename would still be atomic (file block is written completely before we move it), but doing the fsync after should order the meta data commits correctly assuming write barriers. Then again the write and the move would be two different transactions as far as the fs is concerned. Agree it's cleanest if we in fact sync both actions.
          Hide
          vicaya Luke Lu added a comment -

          The write barrier support, as I understand it, is strictly used to flush device/disk cache, which is not actually relevant here.

          You're trying to rely on fs impl detail to minimize fsync. OTOH, I think I might have found a real work-around for ext3/ext4, use dirsync mount option in conjunction with "sync on close".

          Show
          vicaya Luke Lu added a comment - The write barrier support, as I understand it, is strictly used to flush device/disk cache, which is not actually relevant here. You're trying to rely on fs impl detail to minimize fsync. OTOH, I think I might have found a real work-around for ext3/ext4, use dirsync mount option in conjunction with "sync on close".
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Cool. That should work.

          Show
          lhofhansl Lars Hofhansl added a comment - Cool. That should work.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          We should study the perf impact.

          Previously I found that sync-on-close severely impacted file creation time - unless sync-behind-writes is also enabled. (Interestingly sync-behind-writes should not cause any performance detriment as we're dealing with immutable files, and hence delaying writing these dirty blocks to disk in the hopes that they'd be updated before we do so is pointless anyway).

          Show
          lhofhansl Lars Hofhansl added a comment - We should study the perf impact. Previously I found that sync-on-close severely impacted file creation time - unless sync-behind-writes is also enabled. (Interestingly sync-behind-writes should not cause any performance detriment as we're dealing with immutable files, and hence delaying writing these dirty blocks to disk in the hopes that they'd be updated before we do so is pointless anyway).
          Hide
          vik.karma Vikas Vishwakarma added a comment -

          Updating the results for 1000 files DFSIO write test run with dirsync enabled/disabled and results for both runs are similar except for a very small diff in Avg IO rate of -2% with dirsync enabled.

          DFSIO Test:
          ./hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.3.0-cdh5.0.1-sfdc-2.0.0-tests.jar TestDFSIO -write -nrFiles 1000 -fileSize 1000

          Env1 (Datanodes:15, Containers:29, mount options: rw,dirsync,noatime)
          ===========================================================
          14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): ----- TestDFSIO ----- : write
          14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Date & time: Tue Oct 14 04:11:47 GMT+00:00 2014
          14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Number of files: 1000
          14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Total MBytes processed: 1000000.0
          14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Throughput mb/sec: 21.114069074888118
          14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Average IO rate mb/sec: 21.986719131469727
          14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): IO rate std deviation: 4.507905485162703
          14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Test exec time sec: 1937.989
          14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855):

          Env2 (Datanodes:15, Containers:29, mount options: rw,noatime)
          ====================================================
          14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): ----- TestDFSIO ----- : write
          14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Date & time: Tue Oct 14 04:32:25 GMT 2014
          14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Number of files: 1000
          14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Total MBytes processed: 1000000.0
          14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Throughput mb/sec: 21.391594681989666
          14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Average IO rate mb/sec: 22.406478881835938
          14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): IO rate std deviation: 5.169537520933585
          14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Test exec time sec: 1872.904
          14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855):

          Show
          vik.karma Vikas Vishwakarma added a comment - Updating the results for 1000 files DFSIO write test run with dirsync enabled/disabled and results for both runs are similar except for a very small diff in Avg IO rate of -2% with dirsync enabled. DFSIO Test: ./hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.3.0-cdh5.0.1-sfdc-2.0.0-tests.jar TestDFSIO -write -nrFiles 1000 -fileSize 1000 Env1 (Datanodes:15, Containers:29, mount options: rw,dirsync,noatime) =========================================================== 14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): ----- TestDFSIO ----- : write 14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Date & time: Tue Oct 14 04:11:47 GMT+00:00 2014 14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Number of files: 1000 14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Total MBytes processed: 1000000.0 14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Throughput mb/sec: 21.114069074888118 14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Average IO rate mb/sec: 21.986719131469727 14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): IO rate std deviation: 4.507905485162703 14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Test exec time sec: 1937.989 14/10/14 04:11:47 [main] INFO fs.TestDFSIO(855): Env2 (Datanodes:15, Containers:29, mount options: rw,noatime) ==================================================== 14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): ----- TestDFSIO ----- : write 14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Date & time: Tue Oct 14 04:32:25 GMT 2014 14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Number of files: 1000 14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Total MBytes processed: 1000000.0 14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Throughput mb/sec: 21.391594681989666 14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Average IO rate mb/sec: 22.406478881835938 14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): IO rate std deviation: 5.169537520933585 14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855): Test exec time sec: 1872.904 14/10/14 04:32:25 [main] INFO fs.TestDFSIO(855):
          Hide
          kihwal Kihwal Lee added a comment -

          We can make it optional and put in after HDFS-8791.

          Show
          kihwal Kihwal Lee added a comment - We can make it optional and put in after HDFS-8791 .
          Hide
          surendrasingh Surendra Singh Lilhore added a comment -

          We also faced the same problem. Can we recover this kind of block from namenode after getting block report?
          If reported block genstamp and size is matching with the namenode in memory metadata then NameNode can send a command to datanode to recover from wrong replica state.

          Show
          surendrasingh Surendra Singh Lilhore added a comment - We also faced the same problem. Can we recover this kind of block from namenode after getting block report? If reported block genstamp and size is matching with the namenode in memory metadata then NameNode can send a command to datanode to recover from wrong replica state.
          Hide
          kihwal Kihwal Lee added a comment -

          Can we recover this kind of block from namenode after getting block report?

          In most cases, these block files will be 0 byte after reboot. The file system has already lost data, so there is nothing NN or DN can do. The solution is to sync the directory entry.

          Show
          kihwal Kihwal Lee added a comment - Can we recover this kind of block from namenode after getting block report? In most cases, these block files will be 0 byte after reboot. The file system has already lost data, so there is nothing NN or DN can do. The solution is to sync the directory entry.
          Hide
          surendrasingh Surendra Singh Lilhore added a comment -

          Thanks Kihwal Lee. In my case blocks are in complete state in namenode, but datanodes reported in RBW state after reboot.

          Show
          surendrasingh Surendra Singh Lilhore added a comment - Thanks Kihwal Lee . In my case blocks are in complete state in namenode, but datanodes reported in RBW state after reboot.
          Hide
          kihwal Kihwal Lee added a comment -

          We've seen that too. Unless there is other bug in Hadoop, I think directory sync will fix that too.

          Show
          kihwal Kihwal Lee added a comment - We've seen that too. Unless there is other bug in Hadoop, I think directory sync will fix that too.
          Hide
          andrew.wang Andrew Wang added a comment -

          It seems like Lucene has figured out how to fsync a directory in Java 7, worth revisiting this for HDFS given that we've dropped Java 6 support?

          Show
          andrew.wang Andrew Wang added a comment - It seems like Lucene has figured out how to fsync a directory in Java 7, worth revisiting this for HDFS given that we've dropped Java 6 support?
          Hide
          vinayrpet Vinayakumar B added a comment - - edited

          Thanks Andrew Wang for the pointer.
          LUCENE did find the workaround in LUCENE-5588, which was recently broken by openjdk commit tracked in LUCENE-6169.

          This link gives the initial idea of how it might work. But since there's no official documentation of the behavior, openjdk commit broken it.

          Also this workaround is only for linux.
          So need to find a way which works in all platforms or we can adopt this itself atleast for linux.

          from this link

          (Windows already ensure that the metadata is written correctly after atomic rename). On the other hand, MacOSX looks like ignoring fsync requests completely - also on files - if you don't use a special fnctl

          Looks like this directory sync not required for windows afterall. May be some windows expert confirm this.?

          Show
          vinayrpet Vinayakumar B added a comment - - edited Thanks Andrew Wang for the pointer. LUCENE did find the workaround in LUCENE-5588 , which was recently broken by openjdk commit tracked in LUCENE-6169 . This link gives the initial idea of how it might work. But since there's no official documentation of the behavior, openjdk commit broken it. Also this workaround is only for linux. So need to find a way which works in all platforms or we can adopt this itself atleast for linux. from this link (Windows already ensure that the metadata is written correctly after atomic rename). On the other hand, MacOSX looks like ignoring fsync requests completely - also on files - if you don't use a special fnctl Looks like this directory sync not required for windows afterall. May be some windows expert confirm this.?
          Hide
          kihwal Kihwal Lee added a comment -

          It seems like Lucene has figured out how to fsync a directory in Java 7, worth revisiting this for HDFS given that we've dropped Java 6 support?

          +1 for adding the support for branches that require java 7 and higher. I guess the bug in java 9 is not our immediate concern.

          Show
          kihwal Kihwal Lee added a comment - It seems like Lucene has figured out how to fsync a directory in Java 7, worth revisiting this for HDFS given that we've dropped Java 6 support? +1 for adding the support for branches that require java 7 and higher. I guess the bug in java 9 is not our immediate concern.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Attached the proposed changes for trunk,

          Show
          vinayrpet Vinayakumar B added a comment - Attached the proposed changes for trunk,
          Hide
          vinayrpet Vinayakumar B added a comment -

          attached the branch-2 patch

          Show
          vinayrpet Vinayakumar B added a comment - attached the branch-2 patch
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 26s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 5 new or modified test files.
          0 mvndep 0m 46s Maven dependency ordering for branch
          +1 mvninstall 6m 53s branch-2 passed
          +1 compile 7m 18s branch-2 passed with JDK v1.8.0_131
          +1 compile 7m 0s branch-2 passed with JDK v1.7.0_131
          +1 checkstyle 1m 33s branch-2 passed
          +1 mvnsite 1m 57s branch-2 passed
          +1 mvneclipse 0m 30s branch-2 passed
          +1 findbugs 3m 49s branch-2 passed
          +1 javadoc 1m 30s branch-2 passed with JDK v1.8.0_131
          +1 javadoc 1m 52s branch-2 passed with JDK v1.7.0_131
          0 mvndep 0m 15s Maven dependency ordering for patch
          +1 mvninstall 1m 29s the patch passed
          +1 compile 7m 12s the patch passed with JDK v1.8.0_131
          +1 javac 7m 12s the patch passed
          +1 compile 6m 43s the patch passed with JDK v1.7.0_131
          +1 javac 6m 43s the patch passed
          -0 checkstyle 1m 30s root: The patch generated 2 new + 320 unchanged - 1 fixed = 322 total (was 321)
          +1 mvnsite 1m 50s the patch passed
          +1 mvneclipse 0m 31s the patch passed
          +1 whitespace 0m 1s The patch has no whitespace issues.
          +1 findbugs 4m 14s the patch passed
          -1 javadoc 0m 43s hadoop-common-project_hadoop-common-jdk1.8.0_131 with JDK v1.8.0_131 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
          -1 javadoc 0m 50s hadoop-common-project_hadoop-common-jdk1.7.0_131 with JDK v1.7.0_131 generated 1 new + 10 unchanged - 0 fixed = 11 total (was 10)
          +1 unit 8m 0s hadoop-common in the patch passed with JDK v1.7.0_131.
          -1 unit 54m 3s hadoop-hdfs in the patch failed with JDK v1.7.0_131.
          +1 asflicense 0m 23s The patch does not generate ASF License warnings.
          186m 1s



          Reason Tests
          JDK v1.8.0_131 Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes
            hadoop.hdfs.server.namenode.TestDecommissioningStatus
          JDK v1.7.0_131 Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes
            hadoop.tracing.TestTraceAdmin
            hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA
            hadoop.hdfs.server.blockmanagement.TestReplicationPolicyWithUpgradeDomain
          JDK v1.7.0_131 Timed out junit tests org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:8515d35
          JIRA Issue HDFS-5042
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12869210/HDFS-5042-branch-2-01.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 11b7df05d2bb 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision branch-2 / fe185e2
          Default Java 1.7.0_131
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_131 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_131
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/19538/artifact/patchprocess/diff-checkstyle-root.txt
          javadoc https://builds.apache.org/job/PreCommit-HDFS-Build/19538/artifact/patchprocess/diff-javadoc-javadoc-hadoop-common-project_hadoop-common-jdk1.8.0_131.txt
          javadoc https://builds.apache.org/job/PreCommit-HDFS-Build/19538/artifact/patchprocess/diff-javadoc-javadoc-hadoop-common-project_hadoop-common-jdk1.7.0_131.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/19538/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_131.txt
          JDK v1.7.0_131 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19538/testReport/
          modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19538/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 26s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 5 new or modified test files. 0 mvndep 0m 46s Maven dependency ordering for branch +1 mvninstall 6m 53s branch-2 passed +1 compile 7m 18s branch-2 passed with JDK v1.8.0_131 +1 compile 7m 0s branch-2 passed with JDK v1.7.0_131 +1 checkstyle 1m 33s branch-2 passed +1 mvnsite 1m 57s branch-2 passed +1 mvneclipse 0m 30s branch-2 passed +1 findbugs 3m 49s branch-2 passed +1 javadoc 1m 30s branch-2 passed with JDK v1.8.0_131 +1 javadoc 1m 52s branch-2 passed with JDK v1.7.0_131 0 mvndep 0m 15s Maven dependency ordering for patch +1 mvninstall 1m 29s the patch passed +1 compile 7m 12s the patch passed with JDK v1.8.0_131 +1 javac 7m 12s the patch passed +1 compile 6m 43s the patch passed with JDK v1.7.0_131 +1 javac 6m 43s the patch passed -0 checkstyle 1m 30s root: The patch generated 2 new + 320 unchanged - 1 fixed = 322 total (was 321) +1 mvnsite 1m 50s the patch passed +1 mvneclipse 0m 31s the patch passed +1 whitespace 0m 1s The patch has no whitespace issues. +1 findbugs 4m 14s the patch passed -1 javadoc 0m 43s hadoop-common-project_hadoop-common-jdk1.8.0_131 with JDK v1.8.0_131 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) -1 javadoc 0m 50s hadoop-common-project_hadoop-common-jdk1.7.0_131 with JDK v1.7.0_131 generated 1 new + 10 unchanged - 0 fixed = 11 total (was 10) +1 unit 8m 0s hadoop-common in the patch passed with JDK v1.7.0_131. -1 unit 54m 3s hadoop-hdfs in the patch failed with JDK v1.7.0_131. +1 asflicense 0m 23s The patch does not generate ASF License warnings. 186m 1s Reason Tests JDK v1.8.0_131 Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes   hadoop.hdfs.server.namenode.TestDecommissioningStatus JDK v1.7.0_131 Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes   hadoop.tracing.TestTraceAdmin   hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA   hadoop.hdfs.server.blockmanagement.TestReplicationPolicyWithUpgradeDomain JDK v1.7.0_131 Timed out junit tests org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting Subsystem Report/Notes Docker Image:yetus/hadoop:8515d35 JIRA Issue HDFS-5042 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12869210/HDFS-5042-branch-2-01.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 11b7df05d2bb 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2 / fe185e2 Default Java 1.7.0_131 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_131 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_131 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/19538/artifact/patchprocess/diff-checkstyle-root.txt javadoc https://builds.apache.org/job/PreCommit-HDFS-Build/19538/artifact/patchprocess/diff-javadoc-javadoc-hadoop-common-project_hadoop-common-jdk1.8.0_131.txt javadoc https://builds.apache.org/job/PreCommit-HDFS-Build/19538/artifact/patchprocess/diff-javadoc-javadoc-hadoop-common-project_hadoop-common-jdk1.7.0_131.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/19538/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_131.txt JDK v1.7.0_131 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19538/testReport/ modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19538/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Updated the patch to fix javadoc and checkstyle

          Show
          vinayrpet Vinayakumar B added a comment - Updated the patch to fix javadoc and checkstyle
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 18s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 5 new or modified test files.
          0 mvndep 1m 41s Maven dependency ordering for branch
          +1 mvninstall 15m 40s trunk passed
          +1 compile 15m 32s trunk passed
          +1 checkstyle 2m 1s trunk passed
          +1 mvnsite 2m 7s trunk passed
          +1 mvneclipse 0m 41s trunk passed
          -1 findbugs 1m 36s hadoop-common-project/hadoop-common in trunk has 19 extant Findbugs warnings.
          +1 javadoc 1m 46s trunk passed
          0 mvndep 0m 15s Maven dependency ordering for patch
          +1 mvninstall 1m 44s the patch passed
          +1 compile 14m 40s the patch passed
          +1 javac 14m 40s the patch passed
          -0 checkstyle 2m 3s root: The patch generated 2 new + 307 unchanged - 1 fixed = 309 total (was 308)
          +1 mvnsite 2m 3s the patch passed
          +1 mvneclipse 0m 41s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 3m 29s the patch passed
          +1 javadoc 1m 38s the patch passed
          +1 unit 7m 48s hadoop-common in the patch passed.
          -1 unit 72m 11s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 36s The patch does not generate ASF License warnings.
          151m 56s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes
            hadoop.hdfs.server.namenode.ha.TestPipelinesFailover



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue HDFS-5042
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12869396/HDFS-5042-02.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 2a583979c386 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / d0f346a
          Default Java 1.8.0_131
          findbugs v3.1.0-RC1
          findbugs https://builds.apache.org/job/PreCommit-HDFS-Build/19554/artifact/patchprocess/branch-findbugs-hadoop-common-project_hadoop-common-warnings.html
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/19554/artifact/patchprocess/diff-checkstyle-root.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/19554/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19554/testReport/
          modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19554/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 18s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 5 new or modified test files. 0 mvndep 1m 41s Maven dependency ordering for branch +1 mvninstall 15m 40s trunk passed +1 compile 15m 32s trunk passed +1 checkstyle 2m 1s trunk passed +1 mvnsite 2m 7s trunk passed +1 mvneclipse 0m 41s trunk passed -1 findbugs 1m 36s hadoop-common-project/hadoop-common in trunk has 19 extant Findbugs warnings. +1 javadoc 1m 46s trunk passed 0 mvndep 0m 15s Maven dependency ordering for patch +1 mvninstall 1m 44s the patch passed +1 compile 14m 40s the patch passed +1 javac 14m 40s the patch passed -0 checkstyle 2m 3s root: The patch generated 2 new + 307 unchanged - 1 fixed = 309 total (was 308) +1 mvnsite 2m 3s the patch passed +1 mvneclipse 0m 41s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 3m 29s the patch passed +1 javadoc 1m 38s the patch passed +1 unit 7m 48s hadoop-common in the patch passed. -1 unit 72m 11s hadoop-hdfs in the patch failed. +1 asflicense 0m 36s The patch does not generate ASF License warnings. 151m 56s Reason Tests Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes   hadoop.hdfs.server.namenode.ha.TestPipelinesFailover Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue HDFS-5042 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12869396/HDFS-5042-02.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 2a583979c386 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / d0f346a Default Java 1.8.0_131 findbugs v3.1.0-RC1 findbugs https://builds.apache.org/job/PreCommit-HDFS-Build/19554/artifact/patchprocess/branch-findbugs-hadoop-common-project_hadoop-common-warnings.html checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/19554/artifact/patchprocess/diff-checkstyle-root.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/19554/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19554/testReport/ modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19554/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          vinayrpet Vinayakumar B added a comment -

          updated patch

          Show
          vinayrpet Vinayakumar B added a comment - updated patch
          Hide
          vinayrpet Vinayakumar B added a comment -

          Mentioned directory sync will be called on block close() if sync_on_close is configured.

          Show
          vinayrpet Vinayakumar B added a comment - Mentioned directory sync will be called on block close() if sync_on_close is configured.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 22s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 5 new or modified test files.
          0 mvndep 1m 38s Maven dependency ordering for branch
          +1 mvninstall 14m 20s trunk passed
          +1 compile 14m 25s trunk passed
          +1 checkstyle 1m 57s trunk passed
          +1 mvnsite 2m 4s trunk passed
          +1 mvneclipse 0m 41s trunk passed
          -1 findbugs 1m 22s hadoop-common-project/hadoop-common in trunk has 19 extant Findbugs warnings.
          +1 javadoc 1m 36s trunk passed
          0 mvndep 0m 14s Maven dependency ordering for patch
          +1 mvninstall 1m 26s the patch passed
          +1 compile 12m 32s the patch passed
          +1 javac 12m 32s the patch passed
          +1 checkstyle 1m 57s root: The patch generated 0 new + 307 unchanged - 1 fixed = 307 total (was 308)
          +1 mvnsite 2m 3s the patch passed
          +1 mvneclipse 0m 41s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 3m 28s the patch passed
          +1 javadoc 1m 36s the patch passed
          +1 unit 7m 41s hadoop-common in the patch passed.
          -1 unit 70m 39s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 39s The patch does not generate ASF License warnings.
          144m 29s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes
            hadoop.hdfs.TestDFSStripedOutputStreamWithFailure090
            hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080
            hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure
            hadoop.hdfs.TestDFSRSDefault10x4StripedOutputStreamWithFailure



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue HDFS-5042
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12869501/HDFS-5042-03.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux a63fbf0bd5e5 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 52661e0
          Default Java 1.8.0_131
          findbugs v3.1.0-RC1
          findbugs https://builds.apache.org/job/PreCommit-HDFS-Build/19565/artifact/patchprocess/branch-findbugs-hadoop-common-project_hadoop-common-warnings.html
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/19565/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19565/testReport/
          modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19565/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 22s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 5 new or modified test files. 0 mvndep 1m 38s Maven dependency ordering for branch +1 mvninstall 14m 20s trunk passed +1 compile 14m 25s trunk passed +1 checkstyle 1m 57s trunk passed +1 mvnsite 2m 4s trunk passed +1 mvneclipse 0m 41s trunk passed -1 findbugs 1m 22s hadoop-common-project/hadoop-common in trunk has 19 extant Findbugs warnings. +1 javadoc 1m 36s trunk passed 0 mvndep 0m 14s Maven dependency ordering for patch +1 mvninstall 1m 26s the patch passed +1 compile 12m 32s the patch passed +1 javac 12m 32s the patch passed +1 checkstyle 1m 57s root: The patch generated 0 new + 307 unchanged - 1 fixed = 307 total (was 308) +1 mvnsite 2m 3s the patch passed +1 mvneclipse 0m 41s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 3m 28s the patch passed +1 javadoc 1m 36s the patch passed +1 unit 7m 41s hadoop-common in the patch passed. -1 unit 70m 39s hadoop-hdfs in the patch failed. +1 asflicense 0m 39s The patch does not generate ASF License warnings. 144m 29s Reason Tests Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes   hadoop.hdfs.TestDFSStripedOutputStreamWithFailure090   hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080   hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure   hadoop.hdfs.TestDFSRSDefault10x4StripedOutputStreamWithFailure Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue HDFS-5042 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12869501/HDFS-5042-03.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux a63fbf0bd5e5 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 52661e0 Default Java 1.8.0_131 findbugs v3.1.0-RC1 findbugs https://builds.apache.org/job/PreCommit-HDFS-Build/19565/artifact/patchprocess/branch-findbugs-hadoop-common-project_hadoop-common-warnings.html unit https://builds.apache.org/job/PreCommit-HDFS-Build/19565/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19565/testReport/ modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19565/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Hi Kihwal Lee/Andrew Wang
          Do changes looks good to you?

          Show
          vinayrpet Vinayakumar B added a comment - Hi Kihwal Lee / Andrew Wang Do changes looks good to you?
          Hide
          kihwal Kihwal Lee added a comment -

          fsync(File fileToSync) will leak file descriptors and other associated objects in the current code. Can we cache some directory FileChannels? Caching source dirs might be simple. We just make sure they get added/removed when volumes get added/removed online. But then I am not sure how much impact it will have. For majority of use cases, those directory inodes will stay cached in kernel, so it is less likely cause any seeks. It might mattter for dense datanodes with high memory pressure.

          Show
          kihwal Kihwal Lee added a comment - fsync(File fileToSync) will leak file descriptors and other associated objects in the current code. Can we cache some directory FileChannels? Caching source dirs might be simple. We just make sure they get added/removed when volumes get added/removed online. But then I am not sure how much impact it will have. For majority of use cases, those directory inodes will stay cached in kernel, so it is less likely cause any seeks. It might mattter for dense datanodes with high memory pressure.
          Hide
          vinayrpet Vinayakumar B added a comment -

          fsync(File fileToSync) will leak file descriptors and other associated objects in the current code.

          I am not sure what makes you think like this. channel is opened in try-with-resources block, so it should get closed automatically.

              try(FileChannel channel = FileChannel.open(fileToSync.toPath(),
                  isDir ? StandardOpenOption.READ : StandardOpenOption.WRITE)){
                fsync(channel, isDir);
              }

          Am I missing something here?

          Can we cache some directory FileChannels? Caching source dirs might be simple. We just make sure they get added/removed when volumes get added/removed online. But then I am not sure how much impact it will have. For majority of use cases, those directory inodes will stay cached in kernel, so it is less likely cause any seeks. It might mattter for dense datanodes with high memory pressure

          Caching seems to be good improvement for src directory, which will be either of tmp or rbw. Caching of destination is not good.

          Show
          vinayrpet Vinayakumar B added a comment - fsync(File fileToSync) will leak file descriptors and other associated objects in the current code. I am not sure what makes you think like this. channel is opened in try-with-resources block, so it should get closed automatically. try (FileChannel channel = FileChannel.open(fileToSync.toPath(), isDir ? StandardOpenOption.READ : StandardOpenOption.WRITE)){ fsync(channel, isDir); } Am I missing something here? Can we cache some directory FileChannels? Caching source dirs might be simple. We just make sure they get added/removed when volumes get added/removed online. But then I am not sure how much impact it will have. For majority of use cases, those directory inodes will stay cached in kernel, so it is less likely cause any seeks. It might mattter for dense datanodes with high memory pressure Caching seems to be good improvement for src directory, which will be either of tmp or rbw. Caching of destination is not good.
          Hide
          kihwal Kihwal Lee added a comment -

          Am I missing something here?

          Sorry, did not look closely.

          Show
          kihwal Kihwal Lee added a comment - Am I missing something here? Sorry, did not look closely.
          Hide
          kihwal Kihwal Lee added a comment -

          Does it make sense to have FileIoProvider#sync() to call the new IOUtils.fsync()?

          Show
          kihwal Kihwal Lee added a comment - Does it make sense to have FileIoProvider#sync() to call the new IOUtils.fsync() ?
          Hide
          vinayrpet Vinayakumar B added a comment -

          Does it make sense to have FileIoProvider#sync() to call the new IOUtils.fsync()?

          Yes, I will update the patch.

          Show
          vinayrpet Vinayakumar B added a comment - Does it make sense to have FileIoProvider#sync() to call the new IOUtils.fsync()? Yes, I will update the patch.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Updated the patch

          Show
          vinayrpet Vinayakumar B added a comment - Updated the patch
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 20s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 5 new or modified test files.
          0 mvndep 0m 18s Maven dependency ordering for branch
          +1 mvninstall 12m 28s trunk passed
          +1 compile 12m 44s trunk passed
          +1 checkstyle 1m 50s trunk passed
          +1 mvnsite 1m 54s trunk passed
          +1 mvneclipse 0m 33s trunk passed
          -1 findbugs 1m 23s hadoop-common-project/hadoop-common in trunk has 19 extant Findbugs warnings.
          +1 javadoc 1m 29s trunk passed
          0 mvndep 0m 14s Maven dependency ordering for patch
          +1 mvninstall 1m 21s the patch passed
          +1 compile 12m 30s the patch passed
          +1 javac 12m 30s the patch passed
          +1 checkstyle 1m 41s root: The patch generated 0 new + 307 unchanged - 1 fixed = 307 total (was 308)
          +1 mvnsite 1m 50s the patch passed
          +1 mvneclipse 0m 34s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 3m 27s the patch passed
          +1 javadoc 1m 29s the patch passed
          +1 unit 7m 42s hadoop-common in the patch passed.
          -1 unit 65m 13s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 34s The patch does not generate ASF License warnings.
          132m 34s



          Reason Tests
          Failed junit tests hadoop.hdfs.TestDFSStripedOutputStreamWithFailure160
            hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes
            hadoop.hdfs.web.TestWebHdfsTimeouts
            hadoop.hdfs.TestDFSStripedInputStreamWithRandomECPolicy
            hadoop.hdfs.server.balancer.TestBalancer
            hadoop.hdfs.TestDFSStripedOutputStreamWithFailure150



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue HDFS-5042
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12869701/HDFS-5042-04.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux a31fc119ab69 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 1c8dd6d
          Default Java 1.8.0_131
          findbugs v3.1.0-RC1
          findbugs https://builds.apache.org/job/PreCommit-HDFS-Build/19595/artifact/patchprocess/branch-findbugs-hadoop-common-project_hadoop-common-warnings.html
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/19595/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19595/testReport/
          modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19595/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 20s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 5 new or modified test files. 0 mvndep 0m 18s Maven dependency ordering for branch +1 mvninstall 12m 28s trunk passed +1 compile 12m 44s trunk passed +1 checkstyle 1m 50s trunk passed +1 mvnsite 1m 54s trunk passed +1 mvneclipse 0m 33s trunk passed -1 findbugs 1m 23s hadoop-common-project/hadoop-common in trunk has 19 extant Findbugs warnings. +1 javadoc 1m 29s trunk passed 0 mvndep 0m 14s Maven dependency ordering for patch +1 mvninstall 1m 21s the patch passed +1 compile 12m 30s the patch passed +1 javac 12m 30s the patch passed +1 checkstyle 1m 41s root: The patch generated 0 new + 307 unchanged - 1 fixed = 307 total (was 308) +1 mvnsite 1m 50s the patch passed +1 mvneclipse 0m 34s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 3m 27s the patch passed +1 javadoc 1m 29s the patch passed +1 unit 7m 42s hadoop-common in the patch passed. -1 unit 65m 13s hadoop-hdfs in the patch failed. +1 asflicense 0m 34s The patch does not generate ASF License warnings. 132m 34s Reason Tests Failed junit tests hadoop.hdfs.TestDFSStripedOutputStreamWithFailure160   hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes   hadoop.hdfs.web.TestWebHdfsTimeouts   hadoop.hdfs.TestDFSStripedInputStreamWithRandomECPolicy   hadoop.hdfs.server.balancer.TestBalancer   hadoop.hdfs.TestDFSStripedOutputStreamWithFailure150 Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue HDFS-5042 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12869701/HDFS-5042-04.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux a31fc119ab69 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 1c8dd6d Default Java 1.8.0_131 findbugs v3.1.0-RC1 findbugs https://builds.apache.org/job/PreCommit-HDFS-Build/19595/artifact/patchprocess/branch-findbugs-hadoop-common-project_hadoop-common-warnings.html unit https://builds.apache.org/job/PreCommit-HDFS-Build/19595/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19595/testReport/ modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19595/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          kihwal Kihwal Lee added a comment -

          +1 the latest patch looks good.

          Show
          kihwal Kihwal Lee added a comment - +1 the latest patch looks good.
          Hide
          kihwal Kihwal Lee added a comment -

          +1 the latest patch looks good.

          Hold on. TestDataNodeHotSwapVolumes is failing with the patch.

          Show
          kihwal Kihwal Lee added a comment - +1 the latest patch looks good. Hold on. TestDataNodeHotSwapVolumes is failing with the patch.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Fixed the test failure.

          Show
          vinayrpet Vinayakumar B added a comment - Fixed the test failure.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Attaching branch-2 patch

          Show
          vinayrpet Vinayakumar B added a comment - Attaching branch-2 patch
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          -1 patch 0m 9s HDFS-5042 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help.



          Subsystem Report/Notes
          JIRA Issue HDFS-5042
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12870244/HDFS-5042-05-branch-2.patch
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19661/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 patch 0m 9s HDFS-5042 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. Subsystem Report/Notes JIRA Issue HDFS-5042 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12870244/HDFS-5042-05-branch-2.patch Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19661/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Re-attaching the branch-2 patch with name changed.

          Show
          vinayrpet Vinayakumar B added a comment - Re-attaching the branch-2 patch with name changed.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          -1 docker 0m 30s Docker failed to build yetus/hadoop:8515d35.



          Subsystem Report/Notes
          JIRA Issue HDFS-5042
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12870246/HDFS-5042-branch-2-05.patch
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19662/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 docker 0m 30s Docker failed to build yetus/hadoop:8515d35. Subsystem Report/Notes JIRA Issue HDFS-5042 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12870246/HDFS-5042-branch-2-05.patch Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19662/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 37s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 5 new or modified test files.
          0 mvndep 1m 47s Maven dependency ordering for branch
          +1 mvninstall 15m 13s trunk passed
          +1 compile 14m 32s trunk passed
          +1 checkstyle 2m 9s trunk passed
          +1 mvnsite 2m 9s trunk passed
          +1 mvneclipse 0m 42s trunk passed
          -1 findbugs 1m 39s hadoop-common-project/hadoop-common in trunk has 19 extant Findbugs warnings.
          +1 javadoc 1m 41s trunk passed
          0 mvndep 0m 16s Maven dependency ordering for patch
          +1 mvninstall 1m 49s the patch passed
          +1 compile 15m 2s the patch passed
          +1 javac 15m 2s the patch passed
          -0 checkstyle 2m 3s root: The patch generated 1 new + 307 unchanged - 1 fixed = 308 total (was 308)
          +1 mvnsite 2m 28s the patch passed
          +1 mvneclipse 0m 39s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 3m 59s the patch passed
          +1 javadoc 1m 42s the patch passed
          +1 unit 8m 18s hadoop-common in the patch passed.
          -1 unit 92m 52s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 44s The patch does not generate ASF License warnings.
          173m 37s



          Reason Tests
          Failed junit tests hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080
            hadoop.hdfs.server.namenode.TestNameNodeMXBean
            hadoop.hdfs.TestDFSStripedOutputStreamWithFailureWithRandomECPolicy



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue HDFS-5042
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12870243/HDFS-5042-05.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 647ecd3ddd63 3.13.0-108-generic #155-Ubuntu SMP Wed Jan 11 16:58:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 31058b2
          Default Java 1.8.0_131
          findbugs v3.1.0-RC1
          findbugs https://builds.apache.org/job/PreCommit-HDFS-Build/19660/artifact/patchprocess/branch-findbugs-hadoop-common-project_hadoop-common-warnings.html
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/19660/artifact/patchprocess/diff-checkstyle-root.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/19660/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19660/testReport/
          modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19660/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 37s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 5 new or modified test files. 0 mvndep 1m 47s Maven dependency ordering for branch +1 mvninstall 15m 13s trunk passed +1 compile 14m 32s trunk passed +1 checkstyle 2m 9s trunk passed +1 mvnsite 2m 9s trunk passed +1 mvneclipse 0m 42s trunk passed -1 findbugs 1m 39s hadoop-common-project/hadoop-common in trunk has 19 extant Findbugs warnings. +1 javadoc 1m 41s trunk passed 0 mvndep 0m 16s Maven dependency ordering for patch +1 mvninstall 1m 49s the patch passed +1 compile 15m 2s the patch passed +1 javac 15m 2s the patch passed -0 checkstyle 2m 3s root: The patch generated 1 new + 307 unchanged - 1 fixed = 308 total (was 308) +1 mvnsite 2m 28s the patch passed +1 mvneclipse 0m 39s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 3m 59s the patch passed +1 javadoc 1m 42s the patch passed +1 unit 8m 18s hadoop-common in the patch passed. -1 unit 92m 52s hadoop-hdfs in the patch failed. +1 asflicense 0m 44s The patch does not generate ASF License warnings. 173m 37s Reason Tests Failed junit tests hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080   hadoop.hdfs.server.namenode.TestNameNodeMXBean   hadoop.hdfs.TestDFSStripedOutputStreamWithFailureWithRandomECPolicy Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue HDFS-5042 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12870243/HDFS-5042-05.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 647ecd3ddd63 3.13.0-108-generic #155-Ubuntu SMP Wed Jan 11 16:58:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 31058b2 Default Java 1.8.0_131 findbugs v3.1.0-RC1 findbugs https://builds.apache.org/job/PreCommit-HDFS-Build/19660/artifact/patchprocess/branch-findbugs-hadoop-common-project_hadoop-common-warnings.html checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/19660/artifact/patchprocess/diff-checkstyle-root.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/19660/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19660/testReport/ modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19660/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Seems some problem with yetus for branch-2.

          Show
          vinayrpet Vinayakumar B added a comment - Seems some problem with yetus for branch-2.
          Hide
          kanaka Kanaka Kumar Avvaru added a comment -

          Thanks for the patch Vinayakumar B. I think we need to ensure dir sync on hsync() also as client apps may consider the data is flushed to disk. What is your view?

          Show
          kanaka Kanaka Kumar Avvaru added a comment - Thanks for the patch Vinayakumar B . I think we need to ensure dir sync on hsync() also as client apps may consider the data is flushed to disk. What is your view?
          Hide
          vinayrpet Vinayakumar B added a comment -

          I think we need to ensure dir sync on hsync() also as client apps may consider the data is flushed to disk. What is your view?

          I think, its a good point.
          I have been trying to verify this issue.
          Found small blocks created and closed before powere failure, were nowhere exists on disk. Neither in rbw nor in finalized. May be because when the block files were created in rbw these entries also failed to sync to device.
          May be first hsync() request on block file can call fsync on its parent directory (rbw) directory.

          Show
          vinayrpet Vinayakumar B added a comment - I think we need to ensure dir sync on hsync() also as client apps may consider the data is flushed to disk. What is your view? I think, its a good point. I have been trying to verify this issue. Found small blocks created and closed before powere failure, were nowhere exists on disk. Neither in rbw nor in finalized. May be because when the block files were created in rbw these entries also failed to sync to device. May be first hsync() request on block file can call fsync on its parent directory (rbw) directory.
          Hide
          kihwal Kihwal Lee added a comment -

          +1 on the syncing rbw on the fisrt hsync(). But let's focus on the completed files in this jira and implement the extra safety in a separate one. Let's get the branch-2 patch buttoned up.

          Show
          kihwal Kihwal Lee added a comment - +1 on the syncing rbw on the fisrt hsync(). But let's focus on the completed files in this jira and implement the extra safety in a separate one. Let's get the branch-2 patch buttoned up.
          Hide
          vinayrpet Vinayakumar B added a comment -

          +1 on the syncing rbw on the fisrt hsync(). But let's focus on the completed files in this jira and implement the extra safety in a separate one. Let's get the branch-2 patch buttoned up.

          Thanks Kihwal Lee.
          Branch-2 patch is already attached, but jenkins is refusing to take it up. Seems some problem in building docker image.

          Show
          vinayrpet Vinayakumar B added a comment - +1 on the syncing rbw on the fisrt hsync(). But let's focus on the completed files in this jira and implement the extra safety in a separate one. Let's get the branch-2 patch buttoned up. Thanks Kihwal Lee . Branch-2 patch is already attached, but jenkins is refusing to take it up. Seems some problem in building docker image.
          Hide
          vinayrpet Vinayakumar B added a comment -

          For trunk patch, checkstyle can be ignored, as its inline with previous indentation. Test failures are unrelated.

          Show
          vinayrpet Vinayakumar B added a comment - For trunk patch, checkstyle can be ignored, as its inline with previous indentation. Test failures are unrelated.
          Hide
          kihwal Kihwal Lee added a comment -

          +1 the trunk patch looks good. Also the branch-2 patch looks fine.

          Show
          kihwal Kihwal Lee added a comment - +1 the trunk patch looks good. Also the branch-2 patch looks fine.
          Hide
          kihwal Kihwal Lee added a comment -

          I've committed this to trunk and branch-2. branch-2.8 does not have a separate FileIoProvider, so it will need a different patch. I am resolving this for now.

          Show
          kihwal Kihwal Lee added a comment - I've committed this to trunk and branch-2. branch-2.8 does not have a separate FileIoProvider, so it will need a different patch. I am resolving this for now.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11805 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11805/)
          HDFS-5042. Completed files lost after power failure. Contributed by (kihwal: rev 1543d0f5be6a02ad00e7a33e35d78af8516043e3)

          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeHotSwapVolumes.java
          • (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/IOUtils.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/extdataset/ExternalDatasetImpl.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FileIoProvider.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/FsDatasetSpi.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/LocalReplica.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/SimulatedFSDataset.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestSimulatedFSDataset.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11805 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11805/ ) HDFS-5042 . Completed files lost after power failure. Contributed by (kihwal: rev 1543d0f5be6a02ad00e7a33e35d78af8516043e3) (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeHotSwapVolumes.java (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/IOUtils.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/extdataset/ExternalDatasetImpl.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FileIoProvider.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/FsDatasetSpi.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/LocalReplica.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/SimulatedFSDataset.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestSimulatedFSDataset.java
          Hide
          vinayrpet Vinayakumar B added a comment -

          Attaching the branch-2.8 patch

          Show
          vinayrpet Vinayakumar B added a comment - Attaching the branch-2.8 patch
          Hide
          vinayrpet Vinayakumar B added a comment -

          Attaching branch-2.7 patch as well.

          Show
          vinayrpet Vinayakumar B added a comment - Attaching branch-2.7 patch as well.
          Hide
          kihwal Kihwal Lee added a comment -

          In the 2.8 patch,

          +   * @param fileToSync the file to fsync
          +   * @param isDir if true, the given file is a directory (we open for read and
          +   *          ignore IOExceptions, because not all file systems and operating
          +   *          systems allow to fsync on a directory)
          +   */
          +  public static void fsync(File fileToSync) throws IOException {
          

          isDir is not actually a parameter for the method.

          TestDataNodeHotSwapVolumes fails like the previous version of trunk patch.

          Show
          kihwal Kihwal Lee added a comment - In the 2.8 patch, + * @param fileToSync the file to fsync + * @param isDir if true , the given file is a directory (we open for read and + * ignore IOExceptions, because not all file systems and operating + * systems allow to fsync on a directory) + */ + public static void fsync(File fileToSync) throws IOException { isDir is not actually a parameter for the method. TestDataNodeHotSwapVolumes fails like the previous version of trunk patch.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Oops. thats the wrong patch I attached.
          Here is the updated one for branch-2.8

          Show
          vinayrpet Vinayakumar B added a comment - Oops. thats the wrong patch I attached. Here is the updated one for branch-2.8
          Hide
          vinayrpet Vinayakumar B added a comment -

          Updated branch-2.7 patch as well.

          Show
          vinayrpet Vinayakumar B added a comment - Updated branch-2.7 patch as well.
          Hide
          kihwal Kihwal Lee added a comment -

          The patches look good. Now the fix is in branch-2.8 and branch-2.7. Thanks for fixing this Vinay.

          Show
          kihwal Kihwal Lee added a comment - The patches look good. Now the fix is in branch-2.8 and branch-2.7. Thanks for fixing this Vinay.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Thanks a lot Kihwal Lee for reviews and commit.
          Thanks everyone for the discussion and pushing this long pending issue to closure.

          Show
          vinayrpet Vinayakumar B added a comment - Thanks a lot Kihwal Lee for reviews and commit. Thanks everyone for the discussion and pushing this long pending issue to closure.
          Hide
          davelatham Dave Latham added a comment -

          +1, hear hear!

          Thanks Vinayakumar B for driving this in after 4 years, and Andrew Wang for the pointer to the possible solution.

          Show
          davelatham Dave Latham added a comment - +1, hear hear! Thanks Vinayakumar B for driving this in after 4 years, and Andrew Wang for the pointer to the possible solution.
          Hide
          kihwal Kihwal Lee added a comment -

          We are seeing significant performance degradation in 2.8 with this change. Whenever the write load increases, multiple datanodes stop heartbeating for a long time, causing missing blocks. All other Xceiver threads and the actor threads are waiting for the dataset impl lock. The writes make progress, but slow enough to get always caught by jstack.

          "DataXceiver for client  at xxx [Receiving block BP-aaa:blk_xxxx_xxxx]"
           #343116 daemon prio=5 os_prio=0 tid=0x00007f3b1ef18000 nid=0x19193 runnable [0x00007f3a5c104000]
             java.lang.Thread.State: RUNNABLE
                  at sun.nio.ch.FileDispatcherImpl.force0(Native Method)
                  at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76)
                  at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:388)
                  at org.apache.hadoop.io.IOUtils.fsync(IOUtils.java:394)
                  at org.apache.hadoop.io.IOUtils.fsync(IOUtils.java:376)
                  at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.fsyncDirectory(FsDatasetImpl.java:899)
                  at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeReplica(FsDatasetImpl.java:1756)
                  at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeBlock(FsDatasetImpl.java:1724)
                  at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:949)
                  at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854)
                  at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
                  at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
                  at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288)
                  at java.lang.Thread.run(Thread.java:745)
          
             Locked ownable synchronizers:
                  - <0x00000000d5ff46f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
          

          We are still analyzing whether there are other factors in play.

          Show
          kihwal Kihwal Lee added a comment - We are seeing significant performance degradation in 2.8 with this change. Whenever the write load increases, multiple datanodes stop heartbeating for a long time, causing missing blocks. All other Xceiver threads and the actor threads are waiting for the dataset impl lock. The writes make progress, but slow enough to get always caught by jstack. "DataXceiver for client at xxx [Receiving block BP-aaa:blk_xxxx_xxxx]" #343116 daemon prio=5 os_prio=0 tid=0x00007f3b1ef18000 nid=0x19193 runnable [0x00007f3a5c104000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.FileDispatcherImpl.force0(Native Method) at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76) at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:388) at org.apache.hadoop.io.IOUtils.fsync(IOUtils.java:394) at org.apache.hadoop.io.IOUtils.fsync(IOUtils.java:376) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.fsyncDirectory(FsDatasetImpl.java:899) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeReplica(FsDatasetImpl.java:1756) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeBlock(FsDatasetImpl.java:1724) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:949) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - <0x00000000d5ff46f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) We are still analyzing whether there are other factors in play.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Whether doing the fsync() outside the dataset impl lock will help?

          Show
          vinayrpet Vinayakumar B added a comment - Whether doing the fsync() outside the dataset impl lock will help?
          Hide
          vinayrpet Vinayakumar B added a comment -

          Here is the addendum patch to move fsync() out of lock.
          Kihwal Lee, If this is fine to you, can create another Jira to push this if required.

          Show
          vinayrpet Vinayakumar B added a comment - Here is the addendum patch to move fsync() out of lock. Kihwal Lee , If this is fine to you, can create another Jira to push this if required.
          Hide
          nroberts Nathan Roberts added a comment -

          Wondering if we should make this feature configurable. There are some filesystems (like ext4), where these fsync's are affecting much more than the datanode process. If YARN is using the same disks and is writing significant amounts of intermediate data or performing other disk-heavy operations, the entire system will see significantly degraded performance (like disks at 100% for 10s of minutes).

          Show
          nroberts Nathan Roberts added a comment - Wondering if we should make this feature configurable. There are some filesystems (like ext4), where these fsync's are affecting much more than the datanode process. If YARN is using the same disks and is writing significant amounts of intermediate data or performing other disk-heavy operations, the entire system will see significantly degraded performance (like disks at 100% for 10s of minutes).
          Hide
          kihwal Kihwal Lee added a comment -

          Here is the addendum patch to move fsync() out of lock.

          It should definitely help. I think the only requirement is that fsync to be done before acking back to the client.

          Show
          kihwal Kihwal Lee added a comment - Here is the addendum patch to move fsync() out of lock. It should definitely help. I think the only requirement is that fsync to be done before acking back to the client.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Wondering if we should make this feature configurable

          This is already using the existing configuration of dfs.datanode.synconclose, I felt that was sufficient.

          Show
          vinayrpet Vinayakumar B added a comment - Wondering if we should make this feature configurable This is already using the existing configuration of dfs.datanode.synconclose , I felt that was sufficient.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Created HDFS-12157 for the fsyncDir outside lock.

          Show
          vinayrpet Vinayakumar B added a comment - Created HDFS-12157 for the fsyncDir outside lock.

            People

            • Assignee:
              vinayrpet Vinayakumar B
              Reporter:
              davelatham Dave Latham
            • Votes:
              0 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development