Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-7060

Avoid taking locks when sending heartbeats from the DataNode

    Details

    • Type: Improvement
    • Status: Patch Available
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      We're seeing the heartbeat is blocked by the monitor of FsDatasetImpl when the DN is under heavy load of writes:

         java.lang.Thread.State: BLOCKED (on object monitor)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115)
              - waiting to lock <0x0000000780304fb8> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91)
              - locked <0x0000000780612fd8> (a java.lang.Object)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827)
              at java.lang.Thread.run(Thread.java:744)
      
         java.lang.Thread.State: BLOCKED (on object monitor)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743)
              - waiting to lock <0x0000000780304fb8> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
              at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:169)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
              at java.lang.Thread.run(Thread.java:744)
      
         java.lang.Thread.State: RUNNABLE
              at java.io.UnixFileSystem.createFileExclusively(Native Method)
              at java.io.File.createNewFile(File.java:1006)
              at org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753)
              - locked <0x0000000780304fb8> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
              at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:169)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
              at java.lang.Thread.run(Thread.java:744)
      
      1. HDFS-7060-002.patch
        4 kB
        Brahma Reddy Battula
      2. HDFS-7060.001.patch
        5 kB
        Xinwei Qin
      3. HDFS-7060.000.patch
        4 kB
        Haohui Mai

        Issue Links

          Activity

          Hide
          jingzhao Jing Zhao added a comment -

          Thanks for working on this, Haohui Mai! Could you please generally describe why we can remove the locks here?

          Show
          jingzhao Jing Zhao added a comment - Thanks for working on this, Haohui Mai ! Could you please generally describe why we can remove the locks here?
          Hide
          wheat9 Haohui Mai added a comment -

          This patch removes two types of synchronization:

          -    synchronized(dataset) {
          

          As bpSlices itself is a ConcurrentHashMap, there should be no need to protect it by a lock. The behavior, however, is slightly different since FsVolumeList#addBlockPool will add block pools as the value of dfsUsed() is being calculated.

          -  // Used for synchronizing access to usage stats
          -  private final Object statsLock = new Object();
          -
          

          The statsLock seems to be protecting the access of the FsVolumeList. Since FsVolumeList is immutable, it looks like it is unnecessary.

          Show
          wheat9 Haohui Mai added a comment - This patch removes two types of synchronization: - synchronized(dataset) { As bpSlices itself is a ConcurrentHashMap , there should be no need to protect it by a lock. The behavior, however, is slightly different since FsVolumeList#addBlockPool will add block pools as the value of dfsUsed() is being calculated. - // Used for synchronizing access to usage stats - private final Object statsLock = new Object(); - The statsLock seems to be protecting the access of the FsVolumeList. Since FsVolumeList is immutable, it looks like it is unnecessary.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12668468/HDFS-7060.000.patch
          against trunk revision 54e5794.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract
          org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8009//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8009//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668468/HDFS-7060.000.patch against trunk revision 54e5794. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8009//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8009//console This message is automatically generated.
          Hide
          xinwei Xinwei Qin added a comment -

          Hi, Haohui Mai,
          This patch is interesting. what progress recently? what's your latest idea about it?

          Show
          xinwei Xinwei Qin added a comment - Hi, Haohui Mai , This patch is interesting. what progress recently? what's your latest idea about it?
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12668468/HDFS-7060.000.patch
          against trunk revision ff83ae7.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9851//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668468/HDFS-7060.000.patch against trunk revision ff83ae7. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9851//console This message is automatically generated.
          Hide
          wheat9 Haohui Mai added a comment -

          I don't have time to drive it forward right now. Please feel free to assign to yourself if you're interested in working on it.

          Show
          wheat9 Haohui Mai added a comment - I don't have time to drive it forward right now. Please feel free to assign to yourself if you're interested in working on it.
          Hide
          xinwei Xinwei Qin added a comment -

          OK, I'm glad to assign this to me, and will submit a new patch as soon as possible.

          Show
          xinwei Xinwei Qin added a comment - OK, I'm glad to assign this to me, and will submit a new patch as soon as possible.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12704754/HDFS-7060.001.patch
          against trunk revision 3ff1ba2.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.namenode.snapshot.TestUpdatePipelineWithSnapshots

          The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9896//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9896//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12704754/HDFS-7060.001.patch against trunk revision 3ff1ba2. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.snapshot.TestUpdatePipelineWithSnapshots The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9896//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9896//console This message is automatically generated.
          Hide
          xinwei Xinwei Qin added a comment -

          Hi Haohui Mai,
          The new patch is available, please take the time to review it.

          Show
          xinwei Xinwei Qin added a comment - Hi Haohui Mai , The new patch is available, please take the time to review it.
          Hide
          brahmareddy Brahma Reddy Battula added a comment -

          Xinwei Qin thanks for working on this issue...Patch overall look goods to me..+1 ( non binding)

          some minor nits : intends are missed ( 2 spaces + 2 tabs).. Attaching patch for same..

          Haohui Mai can you please give comments..?
          I had seen problems ( where DN's are going to Dead state) in big cluster where DN is having 90000000+ Blocks and 48 data dirs configured...

          Please correct me If I am wrong..Marking this as critical,change back if you feel not required...

          Show
          brahmareddy Brahma Reddy Battula added a comment - Xinwei Qin thanks for working on this issue...Patch overall look goods to me..+1 ( non binding) some minor nits : intends are missed ( 2 spaces + 2 tabs).. Attaching patch for same.. Haohui Mai can you please give comments..? I had seen problems ( where DN's are going to Dead state) in big cluster where DN is having 90000000+ Blocks and 48 data dirs configured... Please correct me If I am wrong..Marking this as critical,change back if you feel not required...
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12708051/HDFS-7060-002.patch
          against trunk revision 3d9132d.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/10102//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/10102//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708051/HDFS-7060-002.patch against trunk revision 3d9132d. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/10102//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/10102//console This message is automatically generated.
          Hide
          sriksun Srikanth Sundarrajan added a comment -

          To avoid locks on FsDatasetSpi::getStorageReports() seems like a reasonable approach and would unblock the heartbeat to DN.

          Show
          sriksun Srikanth Sundarrajan added a comment - To avoid locks on FsDatasetSpi::getStorageReports() seems like a reasonable approach and would unblock the heartbeat to DN.
          Hide
          xinwei Xinwei Qin added a comment -

          Brahma Reddy Battula Thanks for your review.

          some minor nits : intends are missed ( 2 spaces + 2 tabs)

          The indent is just 2 spaces in Hadoop code format. So, I think the 002.patch format has no problem.
          If not, please correct me.

          Show
          xinwei Xinwei Qin added a comment - Brahma Reddy Battula Thanks for your review. some minor nits : intends are missed ( 2 spaces + 2 tabs) The indent is just 2 spaces in Hadoop code format. So, I think the 002.patch format has no problem. If not, please correct me.
          Hide
          brahmareddy Brahma Reddy Battula added a comment -

          Yes, 002.patch uploaded by me

          Show
          brahmareddy Brahma Reddy Battula added a comment - Yes, 002.patch uploaded by me
          Hide
          xinwei Xinwei Qin added a comment -

          Sorry, my mistake. I mean the 2rd patch, which should be 001.patch.

          Show
          xinwei Xinwei Qin added a comment - Sorry, my mistake. I mean the 2rd patch, which should be 001.patch.
          Hide
          xinwei Xinwei Qin added a comment -

          2nd patch

          Show
          xinwei Xinwei Qin added a comment - 2nd patch
          Hide
          cmccabe Colin P. McCabe added a comment - - edited

          While I agree that making the DataNode heartbeat lockless might be a good idea in theory, it is not a bugfix. It is an optimization or improvement.

          In contrast, the issue with FsDatasetImpl#createTemporary holding the lock for a long period is a bug, and it needs to be fixed. Even if the heartbeat is made lockless, HDFS-7999 will still cause many problems such as extremely high write and read latency, unresponsive datanodes, and so on. Making the heartbeat lockless does not fix HDFS-7999, although it might alleviate a few of the symptoms.

          Edit: I see that Haohui explained why we might be able to remove some of these locks. I'll try to take a look later

          Show
          cmccabe Colin P. McCabe added a comment - - edited While I agree that making the DataNode heartbeat lockless might be a good idea in theory, it is not a bugfix. It is an optimization or improvement. In contrast, the issue with FsDatasetImpl#createTemporary holding the lock for a long period is a bug, and it needs to be fixed. Even if the heartbeat is made lockless, HDFS-7999 will still cause many problems such as extremely high write and read latency, unresponsive datanodes, and so on. Making the heartbeat lockless does not fix HDFS-7999 , although it might alleviate a few of the symptoms. Edit: I see that Haohui explained why we might be able to remove some of these locks. I'll try to take a look later
          Hide
          cmccabe Colin P. McCabe added a comment -

          [~Xinwei Qin], Brahma Reddy Battula, do you have information on how much performance improvement this patch gives, assuming that HDFS-7999 has already been applied? I think this is a good improvement to do in any case, but I'm curious how important it is to get this patch in soon. For example, should we try to get it into 2.7? Having some performance numbers would help make that call. Thanks

          Show
          cmccabe Colin P. McCabe added a comment - [~Xinwei Qin] , Brahma Reddy Battula , do you have information on how much performance improvement this patch gives, assuming that HDFS-7999 has already been applied? I think this is a good improvement to do in any case, but I'm curious how important it is to get this patch in soon. For example, should we try to get it into 2.7? Having some performance numbers would help make that call. Thanks
          Hide
          wheat9 Haohui Mai added a comment -

          I agree with Colin that it needs to be well thought out to make sure that we don't introduce new race conditions.

          I think we need clear explanations (possibly some refactoring) before we can put it into branch-2.

          Show
          wheat9 Haohui Mai added a comment - I agree with Colin that it needs to be well thought out to make sure that we don't introduce new race conditions. I think we need clear explanations (possibly some refactoring) before we can put it into branch-2.
          Hide
          brahmareddy Brahma Reddy Battula added a comment -

          Thanks a lot for taking look into this issue..sure we'll update performance numbers..

          Show
          brahmareddy Brahma Reddy Battula added a comment - Thanks a lot for taking look into this issue..sure we'll update performance numbers..
          Hide
          xinwei Xinwei Qin added a comment -

          Colin P. McCabe , Haohui Mai Thanks a lot for your comments.

          We create 400 threads to write 20 million files to a Hadoop cluster with 3 DNs. Every DN has already about 90 million blocks.
          Without HDFS-7060 patch, heartbeat always delays about several hundreds of seconds or even longer to make DN dead.
          With HDFS-7060 patch, heartbeat only sometimes delays about 50s and the DN never dead.

          As can be seen from above, making the DN heartbeat lockless can reduce the delay time of heartbeat significantly, especially in large-scale concurrent read and write scenario. So, if the lock is unnecessary, removing it will be a good idea.

          Show
          xinwei Xinwei Qin added a comment - Colin P. McCabe , Haohui Mai Thanks a lot for your comments. We create 400 threads to write 20 million files to a Hadoop cluster with 3 DNs. Every DN has already about 90 million blocks. Without HDFS-7060 patch, heartbeat always delays about several hundreds of seconds or even longer to make DN dead. With HDFS-7060 patch, heartbeat only sometimes delays about 50s and the DN never dead. As can be seen from above, making the DN heartbeat lockless can reduce the delay time of heartbeat significantly, especially in large-scale concurrent read and write scenario. So, if the lock is unnecessary, removing it will be a good idea.
          Hide
          brahmareddy Brahma Reddy Battula added a comment -

          Colin P. McCabe and Haohui Mai can you please check once..Hopefully this should get in hadoop-2.7...

          Show
          brahmareddy Brahma Reddy Battula added a comment - Colin P. McCabe and Haohui Mai can you please check once..Hopefully this should get in hadoop-2.7...
          Hide
          jnp Jitendra Nath Pandey added a comment -

          We create 400 threads to write 20 million files to a Hadoop cluster with 3 DNs. Every DN has already about 90 million blocks.

          Xinwei Qin , this is a good test. One question, did you test before applying HDFS-7999 patch or after?

          Show
          jnp Jitendra Nath Pandey added a comment - We create 400 threads to write 20 million files to a Hadoop cluster with 3 DNs. Every DN has already about 90 million blocks. Xinwei Qin , this is a good test. One question, did you test before applying HDFS-7999 patch or after?
          Hide
          xinwei Xinwei Qin added a comment -

          Jitendra Nath Pandey Thanks for your comment. This test was done before applying HDFS-7999.

          Show
          xinwei Xinwei Qin added a comment - Jitendra Nath Pandey Thanks for your comment. This test was done before applying HDFS-7999 .
          Hide
          nroberts Nathan Roberts added a comment -

          Has anyone given any further thought on this patch? It seems safe to me and it eliminates serious stability issues when datanodes' disks get very busy. getUsed() returns a value that is calculated by a DU process that ran for a long time anyway, (so it's always somewhat out-of-sync). It's not very difficult to load up a datanode with disk-intensive tasks that prevent the datanode from getting a heartbeat in for several minutes, eventually being declared dead by the NN. We've seen this take out entire clusters with large Map/Reduce merges, as well as very large shuffles.

          Show
          nroberts Nathan Roberts added a comment - Has anyone given any further thought on this patch? It seems safe to me and it eliminates serious stability issues when datanodes' disks get very busy. getUsed() returns a value that is calculated by a DU process that ran for a long time anyway, (so it's always somewhat out-of-sync). It's not very difficult to load up a datanode with disk-intensive tasks that prevent the datanode from getting a heartbeat in for several minutes, eventually being declared dead by the NN. We've seen this take out entire clusters with large Map/Reduce merges, as well as very large shuffles.
          Hide
          wheat9 Haohui Mai added a comment -

          Hi Nathan Roberts, we're being conservative as you never know whether it will lead to new race conditions. Can you please test it in your cluster?

          Show
          wheat9 Haohui Mai added a comment - Hi Nathan Roberts , we're being conservative as you never know whether it will lead to new race conditions. Can you please test it in your cluster?
          Hide
          cmccabe Colin P. McCabe added a comment -

          I took another look at this, and it seems reasonable. Every time I think I've found a potential race condition issue, it turns out that the variable being read is volatile or atomic.

          I agree that if we get a little more testing, this would be a good improvement to make.

          Show
          cmccabe Colin P. McCabe added a comment - I took another look at this, and it seems reasonable. Every time I think I've found a potential race condition issue, it turns out that the variable being read is volatile or atomic. I agree that if we get a little more testing, this would be a good improvement to make.
          Hide
          kihwal Kihwal Lee added a comment - - edited

          We have recently tried it on a big cluster. I can't tell how much it helps, but when the nodes were hit with heavy I/O storm, some of datanodes slowed down enough to be "dead" for minutes. I am guessing it could have been worse without this change. Maybe it will be better with the lifeline protocol?

          Show
          kihwal Kihwal Lee added a comment - - edited We have recently tried it on a big cluster. I can't tell how much it helps, but when the nodes were hit with heavy I/O storm, some of datanodes slowed down enough to be "dead" for minutes. I am guessing it could have been worse without this change. Maybe it will be better with the lifeline protocol?
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          Without holding the dataset lock, we may hit ConcurrentModificationException since the bpSlices map could be modified by reconfiguring the volumes in runtime. We may use a different lock to protect bpSlices. ConcurrentHashMap is not enough (or not very useful) since it does not synchronize iterations.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - Without holding the dataset lock, we may hit ConcurrentModificationException since the bpSlices map could be modified by reconfiguring the volumes in runtime. We may use a different lock to protect bpSlices. ConcurrentHashMap is not enough (or not very useful) since it does not synchronize iterations.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          > ... Maybe it will be better with the lifeline protocol?

          I agree that we need HDFS-9239: datanode lifeline protocol.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - > ... Maybe it will be better with the lifeline protocol? I agree that we need HDFS-9239 : datanode lifeline protocol.
          Hide
          jzhuge John Zhuge added a comment -

          If we assume dataset lock protects bpSlices. The following functions access bpSlices but they and their callers do NOT lock dataset. Did I miss anything? Is this a bug?

          FsVolumeImpl#addBlockPool   (this is a write access to bpSlices)
            FsVolumeImpl#addBlockPool
              FsDatasetImpl#addVolume
          
          FsVolumeImpl#getBlockPoolUsed
            FsVolumeList#getBlockPoolUsed
              FsDatasetImpl#getBlockPoolUsed
          
          FsVolumeImpl#getBlockPoolSlice
            (too many callers, hard to track down !)
          
          FsVolumeImpl#getBlockPoolList
            DirectoryScanner.ReportCompiler#call
          
          FsVolumeImpl#checkDirs
            FsVolumeImpl#checkDirs
              FsDatasetImpl#checkDataDir
          
          ...
          

          A few suggestions regarding lock usage:

          • All locks including intrinsic locks should include docs that clearly state what resources they protect.
          • Locks should be used in the classes they are defined. Very hard to reason if their usages are scattered around.

          If we want to protect bpSlices with lock, can we just use bpSlices intrinsic lock instead of dataset lock?

          Show
          jzhuge John Zhuge added a comment - If we assume dataset lock protects bpSlices . The following functions access bpSlices but they and their callers do NOT lock dataset. Did I miss anything? Is this a bug? FsVolumeImpl#addBlockPool ( this is a write access to bpSlices) FsVolumeImpl#addBlockPool FsDatasetImpl#addVolume FsVolumeImpl#getBlockPoolUsed FsVolumeList#getBlockPoolUsed FsDatasetImpl#getBlockPoolUsed FsVolumeImpl#getBlockPoolSlice (too many callers, hard to track down !) FsVolumeImpl#getBlockPoolList DirectoryScanner.ReportCompiler#call FsVolumeImpl#checkDirs FsVolumeImpl#checkDirs FsDatasetImpl#checkDataDir ... A few suggestions regarding lock usage: All locks including intrinsic locks should include docs that clearly state what resources they protect. Locks should be used in the classes they are defined. Very hard to reason if their usages are scattered around. If we want to protect bpSlices with lock, can we just use bpSlices intrinsic lock instead of dataset lock?
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          -1 patch 0m 5s HDFS-7060 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help.



          Subsystem Report/Notes
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12708051/HDFS-7060-002.patch
          JIRA Issue HDFS-7060
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/14976/console
          Powered by Apache Yetus 0.2.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 patch 0m 5s HDFS-7060 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. Subsystem Report/Notes JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12708051/HDFS-7060-002.patch JIRA Issue HDFS-7060 Console output https://builds.apache.org/job/PreCommit-HDFS-Build/14976/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          I am trying to revive this thread.
          The heartbeat issue with dataset lock will only get more severe into the future, given that datanodes are becoming more dense and multiple NameNode support in Hadoop 3.

          It's common to see DataNodes installing 12 disks, which means 12 dataset locks per heartbeat per NameNode. Even a 1 second delay in grabing a dataset lock due to other concurrent activities holding dataset lock results in 12 second delay in heartbeat, and more NameNodes means even more lock contention.

          I aggree Tsz Wo Nicholas Sze's concern about ConcurrentModificationException is valid. But modifying BlockPoolSlice can be done by holding the BlockPoolSlice object lock, like John Zhuge suggested.

          getDfsUsed is, essentially an approximation of disk usage, and I don't feel the added gain is sufficient to justify the latency in heartbeats.

          At the least, I think we can have a NoLock version of getDfsUsed, which must acquire dataset lock before being called. That way we can eliminate most lock contention.

          Show
          jojochuang Wei-Chiu Chuang added a comment - I am trying to revive this thread. The heartbeat issue with dataset lock will only get more severe into the future, given that datanodes are becoming more dense and multiple NameNode support in Hadoop 3. It's common to see DataNodes installing 12 disks, which means 12 dataset locks per heartbeat per NameNode. Even a 1 second delay in grabing a dataset lock due to other concurrent activities holding dataset lock results in 12 second delay in heartbeat, and more NameNodes means even more lock contention. I aggree Tsz Wo Nicholas Sze 's concern about ConcurrentModificationException is valid. But modifying BlockPoolSlice can be done by holding the BlockPoolSlice object lock, like John Zhuge suggested. getDfsUsed is, essentially an approximation of disk usage, and I don't feel the added gain is sufficient to justify the latency in heartbeats. At the least, I think we can have a NoLock version of getDfsUsed, which must acquire dataset lock before being called. That way we can eliminate most lock contention.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          -1 patch 0m 5s HDFS-7060 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help.



          Subsystem Report/Notes
          JIRA Issue HDFS-7060
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12708051/HDFS-7060-002.patch
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/17973/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 patch 0m 5s HDFS-7060 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. Subsystem Report/Notes JIRA Issue HDFS-7060 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12708051/HDFS-7060-002.patch Console output https://builds.apache.org/job/PreCommit-HDFS-Build/17973/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          BlockPoolSplice objects in FsVolumeImpl are stored in ConcurrentHashMap, so ConcurrentModificationException is not possible. FsVolumeImpl objects are stored in CopyOnWriteArrayList, so adding/removing volumes does not throw ConcurrentModificationException either.

          Show
          jojochuang Wei-Chiu Chuang added a comment - BlockPoolSplice objects in FsVolumeImpl are stored in ConcurrentHashMap, so ConcurrentModificationException is not possible. FsVolumeImpl objects are stored in CopyOnWriteArrayList, so adding/removing volumes does not throw ConcurrentModificationException either.
          Hide
          yangjiandan Jiandan Yang added a comment -

          Xinwei Qin Brahma Reddy Battula Wei-Chiu Chuang We encountered the same problem(branch-2.8.2), BPServiceActor#offerService blocked because sendHeartBeat waited for FSDataset lock, and blockReceivedAndDeleted was delay about 60s, and eventually client can not close file and threw Exception "Unable to close file because the last blockxxx does not have enough number of replicas”

          I think HDFS-7060 can solve our problem very well. Does this patch have any problem? Why does it merge into trunk.

          Show
          yangjiandan Jiandan Yang added a comment - Xinwei Qin Brahma Reddy Battula Wei-Chiu Chuang We encountered the same problem(branch-2.8.2), BPServiceActor#offerService blocked because sendHeartBeat waited for FSDataset lock, and blockReceivedAndDeleted was delay about 60s, and eventually client can not close file and threw Exception "Unable to close file because the last blockxxx does not have enough number of replicas” I think HDFS-7060 can solve our problem very well. Does this patch have any problem? Why does it merge into trunk.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          -1 patch 0m 8s HDFS-7060 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help.



          Subsystem Report/Notes
          JIRA Issue HDFS-7060
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12708051/HDFS-7060-002.patch
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/21760/console
          Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 patch 0m 8s HDFS-7060 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. Subsystem Report/Notes JIRA Issue HDFS-7060 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12708051/HDFS-7060-002.patch Console output https://builds.apache.org/job/PreCommit-HDFS-Build/21760/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.

            People

            • Assignee:
              xinwei Xinwei Qin
              Reporter:
              wheat9 Haohui Mai
            • Votes:
              0 Vote for this issue
              Watchers:
              27 Start watching this issue

              Dates

              • Created:
                Updated:

                Development