Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9239

DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: datanode, namenode
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      This release adds a new feature called the DataNode Lifeline Protocol. If configured, then DataNodes can report that they are still alive to the NameNode via a fallback protocol, separate from the existing heartbeat messages. This can prevent the NameNode from incorrectly marking DataNodes as stale or dead in highly overloaded clusters where heartbeat processing is suffering delays. For more information, please refer to the hdfs-default.xml documentation for several new configuration properties: dfs.namenode.lifeline.rpc-address, dfs.namenode.lifeline.rpc-bind-host, dfs.datanode.lifeline.interval.seconds, dfs.namenode.lifeline.handler.ratio and dfs.namenode.lifeline.handler.count.
      Show
      This release adds a new feature called the DataNode Lifeline Protocol. If configured, then DataNodes can report that they are still alive to the NameNode via a fallback protocol, separate from the existing heartbeat messages. This can prevent the NameNode from incorrectly marking DataNodes as stale or dead in highly overloaded clusters where heartbeat processing is suffering delays. For more information, please refer to the hdfs-default.xml documentation for several new configuration properties: dfs.namenode.lifeline.rpc-address, dfs.namenode.lifeline.rpc-bind-host, dfs.datanode.lifeline.interval.seconds, dfs.namenode.lifeline.handler.ratio and dfs.namenode.lifeline.handler.count.

      Description

      This issue proposes introduction of a new feature: the DataNode Lifeline Protocol. This is an RPC protocol that is responsible for reporting liveness and basic health information about a DataNode to a NameNode. Compared to the existing heartbeat messages, it is lightweight and not prone to resource contention problems that can harm accurate tracking of DataNode liveness currently. The attached design document contains more details.

      1. DataNode-Lifeline-Protocol.pdf
        124 kB
        Chris Nauroth
      2. HDFS-9239.001.patch
        75 kB
        Chris Nauroth
      3. HDFS-9239.002.patch
        77 kB
        Chris Nauroth
      4. HDFS-9239.003.patch
        77 kB
        Chris Nauroth

        Issue Links

          Activity

          Hide
          stevel@apache.org Steve Loughran added a comment -

          Does this actually need RPC? Or could some UDP packet be submitted, with the recipient doing an initial auth check of the sender, then queuing it to update the internal state. After all, this is intended to be one-way announcements

          Show
          stevel@apache.org Steve Loughran added a comment - Does this actually need RPC? Or could some UDP packet be submitted, with the recipient doing an initial auth check of the sender, then queuing it to update the internal state. After all, this is intended to be one-way announcements
          Hide
          cnauroth Chris Nauroth added a comment -

          I briefly considered UDP for that very reason. However, there isn't a strong precedent for UDP in Hadoop right now, aside from a few optional components like the HDFS NFS gateway and the Ganglia and StatsD metrics sinks. I'm reluctant to add UDP delivery troubleshooting to the mix of operational concerns that administrators need to know to support core functionality.

          Show
          cnauroth Chris Nauroth added a comment - I briefly considered UDP for that very reason. However, there isn't a strong precedent for UDP in Hadoop right now, aside from a few optional components like the HDFS NFS gateway and the Ganglia and StatsD metrics sinks. I'm reluctant to add UDP delivery troubleshooting to the mix of operational concerns that administrators need to know to support core functionality.
          Hide
          daryn Daryn Sharp added a comment -

          It seems like a good idea at first, but I don't think the proposal solves the stated issues:

          • This <protocol> prevents the NameNode from spuriously marking healthy DataNodes as stale or dead.
          • ... delayed DataNodes may be flagged as stale, and applications may erroneously choose to avoid accessing those nodes
          • ... DataNodes may be flagged as dead. In extreme cases, this can cause a NameNode to schedule wasteful re¬≠replication activity.

          Let's say the NN can't service heartbeats to avoid false-staleness (stale defaults to 30s). That means it definitely can't process IBRs either. Would a lifeline to prevent the stale flag matter at this point? At this level of congestion, nearly all of the nodes are going stale. The staleness is probably the least of your worries. If nodes are marked dead from inability to keep up with heartbeats (defaults to ~10min), the cluster itself is already. Worrying about wasted replications is dubious because the NN can't issue replications if it can't process the heartbeats.

          That is not a heavy load scenario. From personal experience, it sounds like the fallout of a 120GB+ heap stop-the-world GC. The NN wakes up, heartbeat monitor starts marking everything dead. This sparks a replication storm, followed by invalidation storm, which the NN recovers from... unless it goes into another full GC. The lifeline might help slow the rise of false-dead nodes. However, I recently patched the heartbeat monitor to detect long GCs and be very gracious before marking nodes dead.

          If I've misinterpreted anything, please describe the incident that prompted this approach so we can see if it would have helped.

          Show
          daryn Daryn Sharp added a comment - It seems like a good idea at first, but I don't think the proposal solves the stated issues: This <protocol> prevents the NameNode from spuriously marking healthy DataNodes as stale or dead. ... delayed DataNodes may be flagged as stale, and applications may erroneously choose to avoid accessing those nodes ... DataNodes may be flagged as dead. In extreme cases, this can cause a NameNode to schedule wasteful re­replication activity. Let's say the NN can't service heartbeats to avoid false-staleness (stale defaults to 30s). That means it definitely can't process IBRs either. Would a lifeline to prevent the stale flag matter at this point? At this level of congestion, nearly all of the nodes are going stale. The staleness is probably the least of your worries. If nodes are marked dead from inability to keep up with heartbeats (defaults to ~10min), the cluster itself is already. Worrying about wasted replications is dubious because the NN can't issue replications if it can't process the heartbeats. That is not a heavy load scenario. From personal experience, it sounds like the fallout of a 120GB+ heap stop-the-world GC. The NN wakes up, heartbeat monitor starts marking everything dead. This sparks a replication storm, followed by invalidation storm, which the NN recovers from... unless it goes into another full GC. The lifeline might help slow the rise of false-dead nodes. However, I recently patched the heartbeat monitor to detect long GCs and be very gracious before marking nodes dead. If I've misinterpreted anything, please describe the incident that prompted this approach so we can see if it would have helped.
          Hide
          kihwal Kihwal Lee added a comment - - edited

          It may not help much with the namenode side. Even on extremely busy clusters, I have not seen nodes missing heartbeat and considered dead because of the contention among heartbeats, incremental block reports (IBR) and full block reports (FBR). Well before node liveness is affected by inundation of IBRs and FBRs, the namenode performance will degrade to unacceptable level. It is really easy to test this. Create a wide job that creates a lot of small files.

          However,making it lighter on the datanode side is a good idea. We have seen many cases where nodes are declared dead because the service actor thread is delayed/blocked.

          Show
          kihwal Kihwal Lee added a comment - - edited It may not help much with the namenode side. Even on extremely busy clusters, I have not seen nodes missing heartbeat and considered dead because of the contention among heartbeats, incremental block reports (IBR) and full block reports (FBR). Well before node liveness is affected by inundation of IBRs and FBRs, the namenode performance will degrade to unacceptable level. It is really easy to test this. Create a wide job that creates a lot of small files. However,making it lighter on the datanode side is a good idea. We have seen many cases where nodes are declared dead because the service actor thread is delayed/blocked.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          .. Well before node liveness is affected by inundation of IBRs and FBRs, the namenode performance will degrade to unacceptable level...

          Yes, indeed. But if datanodes are marked as dead in that situation, that completely destabilizes the system. At that point, even if we kill certain offending jobs, it takes a while before NN can come back to an acceptable service level. This proposal should help prevent the death after NN is past the overloading scenario.

          I think ZKFC healthcheck should also be separated into a different queue or port so that they are not blocked by other messages in NN's call queue. A failover because NN is busy is not very helpful. The other NN also gets busy and we end up seeing active-standby flip-flop between the namenodes.

          Show
          jnp Jitendra Nath Pandey added a comment - .. Well before node liveness is affected by inundation of IBRs and FBRs, the namenode performance will degrade to unacceptable level... Yes, indeed. But if datanodes are marked as dead in that situation, that completely destabilizes the system. At that point, even if we kill certain offending jobs, it takes a while before NN can come back to an acceptable service level. This proposal should help prevent the death after NN is past the overloading scenario. I think ZKFC healthcheck should also be separated into a different queue or port so that they are not blocked by other messages in NN's call queue. A failover because NN is busy is not very helpful. The other NN also gets busy and we end up seeing active-standby flip-flop between the namenodes.
          Hide
          cnauroth Chris Nauroth added a comment -

          Some pre-requisite code for this was committed in scope of HDFS-9311, so I'm linking the issues.

          Show
          cnauroth Chris Nauroth added a comment - Some pre-requisite code for this was committed in scope of HDFS-9311 , so I'm linking the issues.
          Hide
          cnauroth Chris Nauroth added a comment -

          I'm attaching patch v001. This implements the lifeline protocol calls as described in the design document.

          • Lifeline messages are sent in a separate thread from the existing BPServiceActor thread to avoid getting stalled on other BPServiceActor activity. The scheduling of lifeline messages can be controlled by a few new configuration properties, which are documented in hdfs-default.xml.
          • The NameNode implementation avoids the namesystem lock while processing lifeline messages.
          • TestDataNodeLifeline is a new test suite. It works by using Mockito to inject delays in heartbeat processing and then verifying that lifeline messages still kept the DataNode alive. There are no hard-coded sleep times here. Instead, CountDownLatch coordinates the behavior of multiple threads so that it's deterministic.
          Show
          cnauroth Chris Nauroth added a comment - I'm attaching patch v001. This implements the lifeline protocol calls as described in the design document. Lifeline messages are sent in a separate thread from the existing BPServiceActor thread to avoid getting stalled on other BPServiceActor activity. The scheduling of lifeline messages can be controlled by a few new configuration properties, which are documented in hdfs-default.xml. The NameNode implementation avoids the namesystem lock while processing lifeline messages. TestDataNodeLifeline is a new test suite. It works by using Mockito to inject delays in heartbeat processing and then verifying that lifeline messages still kept the DataNode alive. There are no hard-coded sleep times here. Instead, CountDownLatch coordinates the behavior of multiple threads so that it's deterministic.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 8s docker + precommit patch detected.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 5 new or modified test files.
          +1 mvninstall 3m 44s trunk passed
          +1 compile 5m 27s trunk passed with JDK v1.8.0_60
          +1 compile 5m 7s trunk passed with JDK v1.7.0_79
          +1 checkstyle 1m 10s trunk passed
          +1 mvnsite 2m 3s trunk passed
          +1 mvneclipse 0m 35s trunk passed
          +1 findbugs 4m 29s trunk passed
          +1 javadoc 2m 44s trunk passed with JDK v1.8.0_60
          +1 javadoc 3m 46s trunk passed with JDK v1.7.0_79
          +1 mvninstall 2m 35s the patch passed
          +1 compile 5m 27s the patch passed with JDK v1.8.0_60
          +1 cc 5m 27s the patch passed
          +1 javac 5m 27s the patch passed
          +1 compile 5m 5s the patch passed with JDK v1.7.0_79
          +1 cc 5m 5s the patch passed
          +1 javac 5m 5s the patch passed
          -1 checkstyle 1m 9s Patch generated 10 new checkstyle issues in root (total was 1105, now 1106).
          +1 mvnsite 1m 55s the patch passed
          +1 mvneclipse 0m 35s the patch passed
          -1 whitespace 0m 0s The patch has 3 line(s) that end in whitespace. Use git apply --whitespace=fix.
          +1 xml 0m 1s The patch has no ill-formed XML file.
          +1 findbugs 4m 46s the patch passed
          +1 javadoc 2m 42s the patch passed with JDK v1.8.0_60
          +1 javadoc 3m 45s the patch passed with JDK v1.7.0_79
          -1 unit 8m 42s hadoop-common in the patch failed with JDK v1.8.0_60.
          -1 unit 77m 12s hadoop-hdfs in the patch failed with JDK v1.8.0_60.
          -1 unit 8m 40s hadoop-common in the patch failed with JDK v1.7.0_79.
          -1 unit 67m 38s hadoop-hdfs in the patch failed with JDK v1.7.0_79.
          -1 asflicense 0m 25s Patch generated 56 ASF License warnings.
          221m 35s



          Reason Tests
          JDK v1.8.0_60 Failed junit tests hadoop.ipc.TestIPC
            hadoop.test.TestTimedOutTestsListener
            hadoop.hdfs.server.namenode.TestRecoverStripedBlocks
            hadoop.hdfs.security.TestDelegationTokenForProxyUser
            hadoop.hdfs.server.namenode.snapshot.TestSnapshotDeletion
            hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes
            hadoop.hdfs.TestDFSStripedOutputStreamWithFailure190
            hadoop.hdfs.server.namenode.ha.TestEditLogTailer
            hadoop.hdfs.server.blockmanagement.TestReplicationPolicy
          JDK v1.7.0_79 Failed junit tests hadoop.fs.TestSymlinkLocalFSFileContext
            hadoop.hdfs.TestDFSStripedOutputStreamWithFailure200
            hadoop.hdfs.server.blockmanagement.TestNodeCount
            hadoop.hdfs.TestDFSStripedOutputStreamWithFailure070
            hadoop.hdfs.server.balancer.TestBalancerWithMultipleNameNodes



          Subsystem Report/Notes
          Docker Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-11-11
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12771665/HDFS-9239.001.patch
          JIRA Issue HDFS-9239
          Optional Tests asflicense site mvnsite javac javadoc mvninstall unit xml compile findbugs checkstyle cc
          uname Linux dd09a796022b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/patchprocess/apache-yetus-ee5baeb/precommit/personality/hadoop.sh
          git revision trunk / 6e4562b
          Default Java 1.7.0_79
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_60 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_79
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/whitespace-eol.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.8.0_60.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_60.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.7.0_79.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_79.txt
          unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.8.0_60.txt https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_60.txt https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.7.0_79.txt https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_79.txt
          JDK v1.7.0_79 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/13468/testReport/
          asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/patch-asflicense-problems.txt
          modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
          Max memory used 228MB
          Powered by Apache Yetus http://yetus.apache.org
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/13468/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 8s docker + precommit patch detected. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 5 new or modified test files. +1 mvninstall 3m 44s trunk passed +1 compile 5m 27s trunk passed with JDK v1.8.0_60 +1 compile 5m 7s trunk passed with JDK v1.7.0_79 +1 checkstyle 1m 10s trunk passed +1 mvnsite 2m 3s trunk passed +1 mvneclipse 0m 35s trunk passed +1 findbugs 4m 29s trunk passed +1 javadoc 2m 44s trunk passed with JDK v1.8.0_60 +1 javadoc 3m 46s trunk passed with JDK v1.7.0_79 +1 mvninstall 2m 35s the patch passed +1 compile 5m 27s the patch passed with JDK v1.8.0_60 +1 cc 5m 27s the patch passed +1 javac 5m 27s the patch passed +1 compile 5m 5s the patch passed with JDK v1.7.0_79 +1 cc 5m 5s the patch passed +1 javac 5m 5s the patch passed -1 checkstyle 1m 9s Patch generated 10 new checkstyle issues in root (total was 1105, now 1106). +1 mvnsite 1m 55s the patch passed +1 mvneclipse 0m 35s the patch passed -1 whitespace 0m 0s The patch has 3 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 xml 0m 1s The patch has no ill-formed XML file. +1 findbugs 4m 46s the patch passed +1 javadoc 2m 42s the patch passed with JDK v1.8.0_60 +1 javadoc 3m 45s the patch passed with JDK v1.7.0_79 -1 unit 8m 42s hadoop-common in the patch failed with JDK v1.8.0_60. -1 unit 77m 12s hadoop-hdfs in the patch failed with JDK v1.8.0_60. -1 unit 8m 40s hadoop-common in the patch failed with JDK v1.7.0_79. -1 unit 67m 38s hadoop-hdfs in the patch failed with JDK v1.7.0_79. -1 asflicense 0m 25s Patch generated 56 ASF License warnings. 221m 35s Reason Tests JDK v1.8.0_60 Failed junit tests hadoop.ipc.TestIPC   hadoop.test.TestTimedOutTestsListener   hadoop.hdfs.server.namenode.TestRecoverStripedBlocks   hadoop.hdfs.security.TestDelegationTokenForProxyUser   hadoop.hdfs.server.namenode.snapshot.TestSnapshotDeletion   hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes   hadoop.hdfs.TestDFSStripedOutputStreamWithFailure190   hadoop.hdfs.server.namenode.ha.TestEditLogTailer   hadoop.hdfs.server.blockmanagement.TestReplicationPolicy JDK v1.7.0_79 Failed junit tests hadoop.fs.TestSymlinkLocalFSFileContext   hadoop.hdfs.TestDFSStripedOutputStreamWithFailure200   hadoop.hdfs.server.blockmanagement.TestNodeCount   hadoop.hdfs.TestDFSStripedOutputStreamWithFailure070   hadoop.hdfs.server.balancer.TestBalancerWithMultipleNameNodes Subsystem Report/Notes Docker Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-11-11 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12771665/HDFS-9239.001.patch JIRA Issue HDFS-9239 Optional Tests asflicense site mvnsite javac javadoc mvninstall unit xml compile findbugs checkstyle cc uname Linux dd09a796022b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/patchprocess/apache-yetus-ee5baeb/precommit/personality/hadoop.sh git revision trunk / 6e4562b Default Java 1.7.0_79 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_60 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_79 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/whitespace-eol.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.8.0_60.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_60.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.7.0_79.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_79.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.8.0_60.txt https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_60.txt https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.7.0_79.txt https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_79.txt JDK v1.7.0_79 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/13468/testReport/ asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/13468/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . Max memory used 228MB Powered by Apache Yetus http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-HDFS-Build/13468/console This message was automatically generated.
          Hide
          daryn Daryn Sharp added a comment -

          There are really 2 problems to solve here. Ensuring the DN can actually heartbeat as Kihwal alluded to. Ensuring the NN can process it in a reasonable time.

          In the DN, our main problems with the DN jamming up and not sending heartbeats were: 1) commands (finalize) not handled async. 2) getting the du/df metrics for the heartbeat blocked because the block layout change paralyzed disks. Although finalize is now async, in the more general sense heartbeats response commands should always be decoupled from the sending of the heartbeat.

          On the NN, the fsn lock could be a problem but in practice, we've not had it even with over 5k nodes. But I really like the approach of making the heartbeat stat updates fsn-lockless. Collecting the commands w/o the lock (since it doubles as a operational state lock) isn't trivial or you would have done that. What you could do after the stat update is use tryLock for a short time. If you can't get the lock, oh well, this heartbeat response doesn't get any commands.

          I'm not sure we need yet another RPC server for this purpose. NN heartbeat processing with a lockless + tryLock implementation would make it ideally suited for the existing client and/or service servers.

          Show
          daryn Daryn Sharp added a comment - There are really 2 problems to solve here. Ensuring the DN can actually heartbeat as Kihwal alluded to. Ensuring the NN can process it in a reasonable time. In the DN, our main problems with the DN jamming up and not sending heartbeats were: 1) commands (finalize) not handled async. 2) getting the du/df metrics for the heartbeat blocked because the block layout change paralyzed disks. Although finalize is now async, in the more general sense heartbeats response commands should always be decoupled from the sending of the heartbeat. On the NN, the fsn lock could be a problem but in practice, we've not had it even with over 5k nodes. But I really like the approach of making the heartbeat stat updates fsn-lockless. Collecting the commands w/o the lock (since it doubles as a operational state lock) isn't trivial or you would have done that. What you could do after the stat update is use tryLock for a short time. If you can't get the lock, oh well, this heartbeat response doesn't get any commands. I'm not sure we need yet another RPC server for this purpose. NN heartbeat processing with a lockless + tryLock implementation would make it ideally suited for the existing client and/or service servers.
          Hide
          anu Anu Engineer added a comment -

          Chris Nauroth Without alluding to anything that Daryn Sharp brought up, Thanks for the patch. It looks very good, some minor comments / questions below.

          1. nit : BPofferService#Constructor
          Perhaps add a Precondition to make this relationship explicit. something like Preconditions.checkState(lifelineNnAddrs.size() == nnAddrs.size()), since we access both lists with the same index ?

          2. More of a question for my own understanding :
          In BPOfferServiceActor#run we retry await operation on being interrupted. My question is when would it be safe to retry ? Wanted to understand if there were any scenarios where this can happen. Btw, I do see this as a good coding pattern.

          while (shouldRun()) {
                  try {
                    initialRegistrationComplete.await();
                    break;
                  } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                  }
                }
          

          3. nit: DataNodeManager.java - Javadoc - xmitsInProgress replaced by maxTransfers

          4. In DataNodeManager.java :

            synchronized (datanodeMap) {
                  DatanodeDescriptor nodeinfo = getDatanode(nodeReg);
          

          Two comments here :
          a. Just wondering if it makes sense to move synchronized(datanodeMap) into getDataNode.
          b. Did you intend to call heartbeatManager.updateLifeline inside the synchronized(datanodeMap) or just inside synchronized (heartbeatManager). Do we need to keep the lock on datanodeMap while updating stats ?

          5. hdfs-default.xml : nit : Comment : since we rely on ratio as the default you might want to fix the comment which says default is 1.
          <property>
          <name>dfs.namenode.lifeline.handler.count</name>
          <value></value>
          <description>
          Sets number of RPC server threads the NameNode runs for handling the
          lifeline RPC server. The default value is 1, because this RPC server
          handles only HA health check requests from ZKFC. These are lightweight

          Show
          anu Anu Engineer added a comment - Chris Nauroth Without alluding to anything that Daryn Sharp brought up, Thanks for the patch. It looks very good, some minor comments / questions below. 1. nit : BPofferService#Constructor Perhaps add a Precondition to make this relationship explicit. something like Preconditions.checkState(lifelineNnAddrs.size() == nnAddrs.size()), since we access both lists with the same index ? 2. More of a question for my own understanding : In BPOfferServiceActor#run we retry await operation on being interrupted. My question is when would it be safe to retry ? Wanted to understand if there were any scenarios where this can happen. Btw, I do see this as a good coding pattern. while (shouldRun()) { try { initialRegistrationComplete.await(); break ; } catch (InterruptedException e) { Thread .currentThread().interrupt(); } } 3. nit: DataNodeManager.java - Javadoc - xmitsInProgress replaced by maxTransfers 4. In DataNodeManager.java : synchronized (datanodeMap) { DatanodeDescriptor nodeinfo = getDatanode(nodeReg); Two comments here : a. Just wondering if it makes sense to move synchronized(datanodeMap) into getDataNode. b. Did you intend to call heartbeatManager.updateLifeline inside the synchronized(datanodeMap) or just inside synchronized (heartbeatManager) . Do we need to keep the lock on datanodeMap while updating stats ? 5. hdfs-default.xml : nit : Comment : since we rely on ratio as the default you might want to fix the comment which says default is 1. <property> <name>dfs.namenode.lifeline.handler.count</name> <value></value> <description> Sets number of RPC server threads the NameNode runs for handling the lifeline RPC server. The default value is 1, because this RPC server handles only HA health check requests from ZKFC. These are lightweight
          Hide
          kihwal Kihwal Lee added a comment -

          NN heartbeat processing with a lockless + tryLock implementation would make it ideally suited for the existing client and/or service servers.

          NN should still enforce a max number of skips and guarantee commands are sent in bounded time. Replication or block recovery is done through an asynchronous protocol, but oftentimes clients expect them to be done "soon".

          Show
          kihwal Kihwal Lee added a comment - NN heartbeat processing with a lockless + tryLock implementation would make it ideally suited for the existing client and/or service servers. NN should still enforce a max number of skips and guarantee commands are sent in bounded time. Replication or block recovery is done through an asynchronous protocol, but oftentimes clients expect them to be done "soon".
          Hide
          mingma Ming Ma added a comment -

          Sorry for the jumping in late for the discussion. While we haven't seen any recent issues caused by DNs incorrectly marked as dead, maybe this feature could mitigate replication storm issue where incorrectly marked DNs will cause even more replication?

          • It seems the introduction of a new RPC server is to work around the existing functionality of RPC which only support QoS based on user names. Image if RPC server can provide differentiated service based on method names, then we can just add sendLifeline to existing DatanodeProtocol and have the same RPC server can process the method call at the highest priority. Adding method-based RPC QoS could have help other use cases, for example, if we want to prioritize existing heartbeat over IBR.
          • Regarding the DN contention scenario which blocks it from sending sendLifeline to NN, we could skip all info such as storage reports. But if DN is already such state, maybe not sending sendLifeline is what we want anyway.
          Show
          mingma Ming Ma added a comment - Sorry for the jumping in late for the discussion. While we haven't seen any recent issues caused by DNs incorrectly marked as dead, maybe this feature could mitigate replication storm issue where incorrectly marked DNs will cause even more replication? It seems the introduction of a new RPC server is to work around the existing functionality of RPC which only support QoS based on user names. Image if RPC server can provide differentiated service based on method names, then we can just add sendLifeline to existing DatanodeProtocol and have the same RPC server can process the method call at the highest priority. Adding method-based RPC QoS could have help other use cases, for example, if we want to prioritize existing heartbeat over IBR. Regarding the DN contention scenario which blocks it from sending sendLifeline to NN, we could skip all info such as storage reports. But if DN is already such state, maybe not sending sendLifeline is what we want anyway.
          Hide
          cnauroth Chris Nauroth added a comment -

          Thanks for the great reviews, everyone. I'll respond to some of the feedback now and update the patch later.

          What you could do after the stat update is use tryLock for a short time. If you can't get the lock, oh well, this heartbeat response doesn't get any commands.

          That's an interesting idea. I'll explore this.

          I'm not sure we need yet another RPC server for this purpose.

          Just to make sure everyone is aware, the additional optional RPC server is effectively already there due to committing HDFS-9311, which specifically targeted the related problem of ZKFC health check messages getting blocked. I agree that yet another RPC server is not ideal in terms of operational complexity, but I also saw it as the only viable option short-term achievable in the 2.x line.

          In BPOfferServiceActor#run we retry await operation on being interrupted. My question is when would it be safe to retry ?

          In practice, I expect it never will retry. Suppose the thread enters await on the latch, but gets interrupted before the initial registration completes. The only thing that triggers thread interruption is shutdown of the whole DataNode (either the whole JVM process or a single DataNode inside a MiniDFSCluster). That means that by the time this thread gets interrupted, internal flags have already been updated such that the shouldRun() call on the next iteration will return false. run() would then return with no further action taken, and the thread would stop.

          However, there is also no harm done if the await gets retried. This is a daemon thread, so even if it keeps retrying, it won't block a normal JVM exit.

          Just wondering if it makes sense to move synchronized(datanodeMap) into getDataNode.

          This might be a good idea, but I'd prefer to treat it as a separate code improvement decoupled from the work here. Right now, there are multiple points in the code that depend on specific lock ordering of heartbeatManager first followed by datanodeMap second to prevent deadlock. The current code makes this explicit with nested synchronized (X) blocks instead of implicit by declaring particular methods synchronized. Also, HDFS-8966 is making a lot of the changes in the locking here, so I expect making changes now would just create merge conflicts for that effort later.

          Did you intend to call heartbeatManager.updateLifeline inside the synchronized(datanodeMap) or just inside synchronized (heartbeatManager). Do we need to keep the lock on datanodeMap while updating stats ?

          This locking was intentional. If we do not hold the lock on datanodeMap during the get+update, then there is a risk that multiple heartbeats in flight for the same DataNode could cause a lost update, or even an inconsistency where the final state of the DatanodeDescriptor really contains some stats from one heartbeat and other stats from another heartbeat. I have not observed excessive lock contention here during the issues that prompted me to file this JIRA, so I didn't try hard to optimize this locking away. Some of the work in HDFS-8966 is likely to reduce the locking here anyway.

          NN should still enforce a max number of skips and guarantee commands are sent in bounded time. Replication or block recovery is done through an asynchronous protocol, but oftentimes clients expect them to be done "soon".

          Are you saying that beyond some skipping threshold, the heartbeat should still be considered a failure, eventually causing the DataNode to be marked stale and then dead? I'm not sure how we'd set such a threshold, given that there is no SLA defined on these operations AFAIK. I'd say there is still value in keeping a DataNode alive in these cases, such as for serving reader activity.

          It seems the introduction of a new RPC server is to work around the existing functionality of RPC which only support QoS based on user names.

          Yes, that's partially correct.

          I agree that introduction of another RPC server is in some sense a workaround. In fact, I would make the same argument for dfs.namenode.servicerpc-address in the first place too. Lack of sophisticated QoS drove us to isolate operations to a separate RPC server. (dfs.namenode.servicerpc-address also can have some side benefits in multi-homed environments that want to dedicate a whole separate network interface to certain operations.)

          I think a separate RPC server is a viable short-to-mid-term solution in the 2.x line. Longer term, I'd prefer that we evolve towards more sophisticated QoS that prioritizes critical "control plane" activity like heartbeats. However, this goes deeper than just QoS at the RPC layer, because we have observed contention on the namesystem lock as a contributing factor. This implies that a full solution, running on just one RPC server, requires some kind of call pre-emption capability, or maybe a cooperative multi-tasking approach with operations having the ability to yield the lock. This of course would violate a bunch of assumptions in the NameNode code, which is why I think it's infeasible in 2.x.

          Show
          cnauroth Chris Nauroth added a comment - Thanks for the great reviews, everyone. I'll respond to some of the feedback now and update the patch later. What you could do after the stat update is use tryLock for a short time. If you can't get the lock, oh well, this heartbeat response doesn't get any commands. That's an interesting idea. I'll explore this. I'm not sure we need yet another RPC server for this purpose. Just to make sure everyone is aware, the additional optional RPC server is effectively already there due to committing HDFS-9311 , which specifically targeted the related problem of ZKFC health check messages getting blocked. I agree that yet another RPC server is not ideal in terms of operational complexity, but I also saw it as the only viable option short-term achievable in the 2.x line. In BPOfferServiceActor#run we retry await operation on being interrupted. My question is when would it be safe to retry ? In practice, I expect it never will retry. Suppose the thread enters await on the latch, but gets interrupted before the initial registration completes. The only thing that triggers thread interruption is shutdown of the whole DataNode (either the whole JVM process or a single DataNode inside a MiniDFSCluster ). That means that by the time this thread gets interrupted, internal flags have already been updated such that the shouldRun() call on the next iteration will return false . run() would then return with no further action taken, and the thread would stop. However, there is also no harm done if the await gets retried. This is a daemon thread, so even if it keeps retrying, it won't block a normal JVM exit. Just wondering if it makes sense to move synchronized(datanodeMap) into getDataNode. This might be a good idea, but I'd prefer to treat it as a separate code improvement decoupled from the work here. Right now, there are multiple points in the code that depend on specific lock ordering of heartbeatManager first followed by datanodeMap second to prevent deadlock. The current code makes this explicit with nested synchronized (X) blocks instead of implicit by declaring particular methods synchronized . Also, HDFS-8966 is making a lot of the changes in the locking here, so I expect making changes now would just create merge conflicts for that effort later. Did you intend to call heartbeatManager.updateLifeline inside the synchronized(datanodeMap) or just inside synchronized (heartbeatManager) . Do we need to keep the lock on datanodeMap while updating stats ? This locking was intentional. If we do not hold the lock on datanodeMap during the get+update, then there is a risk that multiple heartbeats in flight for the same DataNode could cause a lost update, or even an inconsistency where the final state of the DatanodeDescriptor really contains some stats from one heartbeat and other stats from another heartbeat. I have not observed excessive lock contention here during the issues that prompted me to file this JIRA, so I didn't try hard to optimize this locking away. Some of the work in HDFS-8966 is likely to reduce the locking here anyway. NN should still enforce a max number of skips and guarantee commands are sent in bounded time. Replication or block recovery is done through an asynchronous protocol, but oftentimes clients expect them to be done "soon". Are you saying that beyond some skipping threshold, the heartbeat should still be considered a failure, eventually causing the DataNode to be marked stale and then dead? I'm not sure how we'd set such a threshold, given that there is no SLA defined on these operations AFAIK. I'd say there is still value in keeping a DataNode alive in these cases, such as for serving reader activity. It seems the introduction of a new RPC server is to work around the existing functionality of RPC which only support QoS based on user names. Yes, that's partially correct. I agree that introduction of another RPC server is in some sense a workaround. In fact, I would make the same argument for dfs.namenode.servicerpc-address in the first place too. Lack of sophisticated QoS drove us to isolate operations to a separate RPC server. ( dfs.namenode.servicerpc-address also can have some side benefits in multi-homed environments that want to dedicate a whole separate network interface to certain operations.) I think a separate RPC server is a viable short-to-mid-term solution in the 2.x line. Longer term, I'd prefer that we evolve towards more sophisticated QoS that prioritizes critical "control plane" activity like heartbeats. However, this goes deeper than just QoS at the RPC layer, because we have observed contention on the namesystem lock as a contributing factor. This implies that a full solution, running on just one RPC server, requires some kind of call pre-emption capability, or maybe a cooperative multi-tasking approach with operations having the ability to yield the lock. This of course would violate a bunch of assumptions in the NameNode code, which is why I think it's infeasible in 2.x.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          -1 patch 0m 4s HDFS-9239 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help.



          Subsystem Report/Notes
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12771665/HDFS-9239.001.patch
          JIRA Issue HDFS-9239
          Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/14257/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 patch 0m 4s HDFS-9239 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. Subsystem Report/Notes JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12771665/HDFS-9239.001.patch JIRA Issue HDFS-9239 Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-HDFS-Build/14257/console This message was automatically generated.
          Hide
          cnauroth Chris Nauroth added a comment -

          I'd like to proceed with this feature, as it has been mentioned as potentially relevant in comments on other JIRAs. I'm attaching patch v002 with just a few small changes:

          1. Rebase on current trunk.
          2. Address comments from Anu.
          3. Fix a few Checkstyle warnings. I think the remaining Checkstyle warnings flagged in the last pre-commit run are not worth addressing, but I'll review the next pre-commit run for new warnings.

          There had been a suggestion of changing the existing heartbeat handling to use tryLock. I explored this a bit, but I'm reluctant to alter mainline heartbeat processing at all. Overall, I think this feature is less intrusive as currently implemented, despite the fact that another RPC server adds some operational complexity. Perhaps a tryLock-based implementation of heartbeat handling could be done in a separate JIRA, again gated by a configuration flag, to enable further experimentation in large clusters.

          Show
          cnauroth Chris Nauroth added a comment - I'd like to proceed with this feature, as it has been mentioned as potentially relevant in comments on other JIRAs. I'm attaching patch v002 with just a few small changes: Rebase on current trunk. Address comments from Anu. Fix a few Checkstyle warnings. I think the remaining Checkstyle warnings flagged in the last pre-commit run are not worth addressing, but I'll review the next pre-commit run for new warnings. There had been a suggestion of changing the existing heartbeat handling to use tryLock. I explored this a bit, but I'm reluctant to alter mainline heartbeat processing at all. Overall, I think this feature is less intrusive as currently implemented, despite the fact that another RPC server adds some operational complexity. Perhaps a tryLock-based implementation of heartbeat handling could be done in a separate JIRA, again gated by a configuration flag, to enable further experimentation in large clusters.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          This feature is going to be very useful for busy clusters.

          Some quick comments:

          // DatanodeManager.handleLifeline
              synchronized (heartbeatManager) {
                synchronized (datanodeMap) {
                  DatanodeDescriptor nodeinfo = getDatanode(nodeReg);
                  ...
                  heartbeatManager.updateLifeline(nodeinfo, reports, cacheCapacity,
                      cacheUsed, xceiverCount, failedVolumes, volumeFailureSummary);
                }
          
          • synchronized (datanodeMap) shoud be synchronized (this). We no longer synchronize on datanodeMap.
          • Do we need synchronized (heartbeatManager)? heartbeatManager.updateLifeline is already synchronized.
          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - This feature is going to be very useful for busy clusters. Some quick comments: // DatanodeManager.handleLifeline synchronized (heartbeatManager) { synchronized (datanodeMap) { DatanodeDescriptor nodeinfo = getDatanode(nodeReg); ... heartbeatManager.updateLifeline(nodeinfo, reports, cacheCapacity, cacheUsed, xceiverCount, failedVolumes, volumeFailureSummary); } synchronized (datanodeMap) shoud be synchronized (this). We no longer synchronize on datanodeMap. Do we need synchronized (heartbeatManager)? heartbeatManager.updateLifeline is already synchronized.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -
          //LifelineSender
          +    public void run() {
          +      while (shouldRun()) {
          +        try {
          +          initialRegistrationComplete.await();
          +          break;
          +        } catch (InterruptedException e) {
          +          Thread.currentThread().interrupt();
          +        }
          +      }
          

          If there is an InterruptedException, it probably should rethrow it as an RuntimeException. Calling Thread.currentThread().interrupt() just set thread's interrupt status but the thread will continue running.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - //LifelineSender + public void run() { + while (shouldRun()) { + try { + initialRegistrationComplete.await(); + break ; + } catch (InterruptedException e) { + Thread .currentThread().interrupt(); + } + } If there is an InterruptedException, it probably should rethrow it as an RuntimeException. Calling Thread.currentThread().interrupt() just set thread's interrupt status but the thread will continue running.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          Should the try-catch be restructured like below?

              @Override
              public void run() {
                try {
                  initialRegistrationComplete.await();
          
                  while (shouldRun()) {
                    try {
                      if (lifelineNamenode == null) {
                        lifelineNamenode = dn.connectToLifelineNN(lifelineNnAddr);
                      }
                      sendLifelineIfDue();
                    } catch (IOException e) {
                      LOG.warn("IOException in LifelineSender for " + BPServiceActor.this, e);
                    }
                    Thread.sleep(scheduler.getLifelineWaitTime());
                  }
                } catch (InterruptedException e) {
                  LOG.warn("LifelineSender interrupted", e);
                }
          
                LOG.info("LifelineSender for " + BPServiceActor.this + " exiting.");
              }
          
          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - Should the try-catch be restructured like below? @Override public void run() { try { initialRegistrationComplete.await(); while (shouldRun()) { try { if (lifelineNamenode == null ) { lifelineNamenode = dn.connectToLifelineNN(lifelineNnAddr); } sendLifelineIfDue(); } catch (IOException e) { LOG.warn( "IOException in LifelineSender for " + BPServiceActor. this , e); } Thread .sleep(scheduler.getLifelineWaitTime()); } } catch (InterruptedException e) { LOG.warn( "LifelineSender interrupted" , e); } LOG.info( "LifelineSender for " + BPServiceActor. this + " exiting." ); }
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 10s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 5 new or modified test files.
          0 mvndep 0m 14s Maven dependency ordering for branch
          +1 mvninstall 6m 49s trunk passed
          +1 compile 7m 36s trunk passed with JDK v1.8.0_72
          +1 compile 6m 50s trunk passed with JDK v1.7.0_95
          +1 checkstyle 1m 25s trunk passed
          +1 mvnsite 2m 1s trunk passed
          +1 mvneclipse 0m 28s trunk passed
          +1 findbugs 3m 31s trunk passed
          +1 javadoc 1m 59s trunk passed with JDK v1.8.0_72
          +1 javadoc 2m 53s trunk passed with JDK v1.7.0_95
          0 mvndep 0m 14s Maven dependency ordering for patch
          +1 mvninstall 1m 30s the patch passed
          +1 compile 7m 7s the patch passed with JDK v1.8.0_72
          +1 cc 7m 7s the patch passed
          +1 javac 7m 7s the patch passed
          +1 compile 7m 19s the patch passed with JDK v1.7.0_95
          +1 cc 7m 19s the patch passed
          +1 javac 7m 19s the patch passed
          -1 checkstyle 1m 19s root: patch generated 8 new + 1053 unchanged - 6 fixed = 1061 total (was 1059)
          +1 mvnsite 1m 56s the patch passed
          +1 mvneclipse 0m 27s the patch passed
          -1 whitespace 0m 0s The patch has 3 line(s) that end in whitespace. Use git apply --whitespace=fix.
          +1 xml 0m 1s The patch has no ill-formed XML file.
          +1 findbugs 3m 57s the patch passed
          +1 javadoc 2m 6s the patch passed with JDK v1.8.0_72
          +1 javadoc 2m 58s the patch passed with JDK v1.7.0_95
          +1 unit 7m 45s hadoop-common in the patch passed with JDK v1.8.0_72.
          -1 unit 75m 13s hadoop-hdfs in the patch failed with JDK v1.8.0_72.
          +1 unit 8m 3s hadoop-common in the patch passed with JDK v1.7.0_95.
          -1 unit 71m 55s hadoop-hdfs in the patch failed with JDK v1.7.0_95.
          +1 asflicense 0m 28s Patch does not generate ASF License warnings.
          227m 52s



          Reason Tests
          JDK v1.8.0_72 Failed junit tests hadoop.hdfs.server.datanode.fsdataset.impl.TestScrLazyPersistFiles
            hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot
            hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl
          JDK v1.7.0_95 Failed junit tests hadoop.hdfs.web.TestWebHdfsTimeouts
            hadoop.hdfs.server.datanode.TestDataNodeLifeline
            hadoop.hdfs.TestRollingUpgrade



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:0ca8df7
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12791058/HDFS-9239.002.patch
          JIRA Issue HDFS-9239
          Optional Tests asflicense mvnsite compile javac javadoc mvninstall unit xml findbugs checkstyle cc
          uname Linux 18e822dc2d0b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 67880cc
          Default Java 1.7.0_95
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_72 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/14698/artifact/patchprocess/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/14698/artifact/patchprocess/whitespace-eol.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/14698/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_72.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/14698/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
          unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/14698/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_72.txt https://builds.apache.org/job/PreCommit-HDFS-Build/14698/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
          JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/14698/testReport/
          modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/14698/console
          Powered by Apache Yetus 0.3.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 10s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 5 new or modified test files. 0 mvndep 0m 14s Maven dependency ordering for branch +1 mvninstall 6m 49s trunk passed +1 compile 7m 36s trunk passed with JDK v1.8.0_72 +1 compile 6m 50s trunk passed with JDK v1.7.0_95 +1 checkstyle 1m 25s trunk passed +1 mvnsite 2m 1s trunk passed +1 mvneclipse 0m 28s trunk passed +1 findbugs 3m 31s trunk passed +1 javadoc 1m 59s trunk passed with JDK v1.8.0_72 +1 javadoc 2m 53s trunk passed with JDK v1.7.0_95 0 mvndep 0m 14s Maven dependency ordering for patch +1 mvninstall 1m 30s the patch passed +1 compile 7m 7s the patch passed with JDK v1.8.0_72 +1 cc 7m 7s the patch passed +1 javac 7m 7s the patch passed +1 compile 7m 19s the patch passed with JDK v1.7.0_95 +1 cc 7m 19s the patch passed +1 javac 7m 19s the patch passed -1 checkstyle 1m 19s root: patch generated 8 new + 1053 unchanged - 6 fixed = 1061 total (was 1059) +1 mvnsite 1m 56s the patch passed +1 mvneclipse 0m 27s the patch passed -1 whitespace 0m 0s The patch has 3 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 xml 0m 1s The patch has no ill-formed XML file. +1 findbugs 3m 57s the patch passed +1 javadoc 2m 6s the patch passed with JDK v1.8.0_72 +1 javadoc 2m 58s the patch passed with JDK v1.7.0_95 +1 unit 7m 45s hadoop-common in the patch passed with JDK v1.8.0_72. -1 unit 75m 13s hadoop-hdfs in the patch failed with JDK v1.8.0_72. +1 unit 8m 3s hadoop-common in the patch passed with JDK v1.7.0_95. -1 unit 71m 55s hadoop-hdfs in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 28s Patch does not generate ASF License warnings. 227m 52s Reason Tests JDK v1.8.0_72 Failed junit tests hadoop.hdfs.server.datanode.fsdataset.impl.TestScrLazyPersistFiles   hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot   hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl JDK v1.7.0_95 Failed junit tests hadoop.hdfs.web.TestWebHdfsTimeouts   hadoop.hdfs.server.datanode.TestDataNodeLifeline   hadoop.hdfs.TestRollingUpgrade Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12791058/HDFS-9239.002.patch JIRA Issue HDFS-9239 Optional Tests asflicense mvnsite compile javac javadoc mvninstall unit xml findbugs checkstyle cc uname Linux 18e822dc2d0b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 67880cc Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_72 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/14698/artifact/patchprocess/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/14698/artifact/patchprocess/whitespace-eol.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/14698/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_72.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/14698/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/14698/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_72.txt https://builds.apache.org/job/PreCommit-HDFS-Build/14698/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/14698/testReport/ modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/14698/console Powered by Apache Yetus 0.3.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          cnauroth Chris Nauroth added a comment -

          Tsz Wo Nicholas Sze, thank you for your code review. I am attaching patch v003:

          • Fixed a few style warnings that were legitimate. I don't plan to fix the remaining style warnings.
          • Fixed a race condition in TestDataNodeLifeline that caused it to fail in the last pre-commit run. The test aggressively downtunes heartbeat interval and staleness interval so that it can quickly simulate the lifeline behavior. I didn't do the math correctly though, so running the test in some environments would cause a DataNode to go stale before arrival of the lifeline message. After correcting the staleness setting, the test now passes consistently in multiple VMs that I tried.
          • Thank you for pointing out the locking differences since HDFS-9371 got committed. I was not up to date with that patch. I have removed the synchronization from DatanodeManager#handleLifeline, which makes it consistent with the current implementation of DatanodeManager#handleHeartbeat.
          • Added comments in LifelineSender to clarify the main loop logic.

          If there is an InterruptedException, it probably should rethrow it as an RuntimeException. Calling Thread.currentThread().interrupt() just set thread's interrupt status but the thread will continue running.

          This relates back to an earlier question from Anu. The only possible way for an interrupt to occur while waiting on this latch is if the state of the actor has been updated such that the next call to shouldRun() will return false. Therefore, effectively the thread does finish running. This is not obvious from reading the code though. Hopefully the comments I added are helpful.

          I don't think it would be appropriate to rethrow an unchecked exception, because the uncaught exception handler would log it as an error, even though it's not really anything to be concerned about.

          Should the try-catch be restructured like below?

          The current BPServiceActor shutdown is controlled by a sequence of:

          1. Change variables so that shouldRun() returns false.
          2. Interrupt the thread, so that it is notified to try checking shouldRun() again.

          I'd prefer that we keep that logic consistent in the new LifelineSender, so that's why I put a while (shouldRun()) loop around initialRegistration.await(). The reason why I put the rest of the logic in a second while (shouldRun()) loop is that it is pointless for every loop iteration to wait on the latch again after initial registration completes. Admittedly, this is probably a micro-optimization, and we could probably put everything in a single while (shouldRun()) loop without harming correctness if you feel strongly about it.

          I did not make any changes in LifelineSender in this revision of the patch, but I did add comments to try to help make the logic clearer. Let me know your thoughts.

          Show
          cnauroth Chris Nauroth added a comment - Tsz Wo Nicholas Sze , thank you for your code review. I am attaching patch v003: Fixed a few style warnings that were legitimate. I don't plan to fix the remaining style warnings. Fixed a race condition in TestDataNodeLifeline that caused it to fail in the last pre-commit run. The test aggressively downtunes heartbeat interval and staleness interval so that it can quickly simulate the lifeline behavior. I didn't do the math correctly though, so running the test in some environments would cause a DataNode to go stale before arrival of the lifeline message. After correcting the staleness setting, the test now passes consistently in multiple VMs that I tried. Thank you for pointing out the locking differences since HDFS-9371 got committed. I was not up to date with that patch. I have removed the synchronization from DatanodeManager#handleLifeline , which makes it consistent with the current implementation of DatanodeManager#handleHeartbeat . Added comments in LifelineSender to clarify the main loop logic. If there is an InterruptedException, it probably should rethrow it as an RuntimeException. Calling Thread.currentThread().interrupt() just set thread's interrupt status but the thread will continue running. This relates back to an earlier question from Anu. The only possible way for an interrupt to occur while waiting on this latch is if the state of the actor has been updated such that the next call to shouldRun() will return false . Therefore, effectively the thread does finish running. This is not obvious from reading the code though. Hopefully the comments I added are helpful. I don't think it would be appropriate to rethrow an unchecked exception, because the uncaught exception handler would log it as an error, even though it's not really anything to be concerned about. Should the try-catch be restructured like below? The current BPServiceActor shutdown is controlled by a sequence of: Change variables so that shouldRun() returns false . Interrupt the thread, so that it is notified to try checking shouldRun() again. I'd prefer that we keep that logic consistent in the new LifelineSender , so that's why I put a while (shouldRun()) loop around initialRegistration.await() . The reason why I put the rest of the logic in a second while (shouldRun()) loop is that it is pointless for every loop iteration to wait on the latch again after initial registration completes. Admittedly, this is probably a micro-optimization, and we could probably put everything in a single while (shouldRun()) loop without harming correctness if you feel strongly about it. I did not make any changes in LifelineSender in this revision of the patch, but I did add comments to try to help make the logic clearer. Let me know your thoughts.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 25s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 5 new or modified test files.
          0 mvndep 0m 31s Maven dependency ordering for branch
          +1 mvninstall 8m 58s trunk passed
          +1 compile 12m 32s trunk passed with JDK v1.8.0_74
          +1 compile 10m 15s trunk passed with JDK v1.7.0_95
          +1 checkstyle 1m 41s trunk passed
          +1 mvnsite 2m 44s trunk passed
          +1 mvneclipse 0m 37s trunk passed
          +1 findbugs 4m 38s trunk passed
          +1 javadoc 3m 0s trunk passed with JDK v1.8.0_74
          +1 javadoc 3m 45s trunk passed with JDK v1.7.0_95
          0 mvndep 0m 20s Maven dependency ordering for patch
          +1 mvninstall 1m 59s the patch passed
          +1 compile 12m 50s the patch passed with JDK v1.8.0_74
          -1 cc 15m 10s root-jdk1.8.0_74 with JDK v1.8.0_74 generated 1 new + 9 unchanged - 1 fixed = 10 total (was 10)
          +1 cc 12m 50s the patch passed
          +1 javac 12m 50s the patch passed
          +1 compile 10m 11s the patch passed with JDK v1.7.0_95
          +1 cc 10m 11s the patch passed
          +1 javac 10m 11s the patch passed
          -1 checkstyle 1m 46s root: patch generated 7 new + 1053 unchanged - 6 fixed = 1060 total (was 1059)
          +1 mvnsite 2m 44s the patch passed
          +1 mvneclipse 0m 42s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 xml 0m 1s The patch has no ill-formed XML file.
          +1 findbugs 5m 33s the patch passed
          +1 javadoc 3m 1s the patch passed with JDK v1.8.0_74
          +1 javadoc 3m 53s the patch passed with JDK v1.7.0_95
          -1 unit 11m 14s hadoop-common in the patch failed with JDK v1.8.0_74.
          -1 unit 95m 57s hadoop-hdfs in the patch failed with JDK v1.8.0_74.
          +1 unit 10m 24s hadoop-common in the patch passed with JDK v1.7.0_95.
          -1 unit 79m 57s hadoop-hdfs in the patch failed with JDK v1.7.0_95.
          +1 asflicense 0m 26s Patch does not generate ASF License warnings.
          292m 8s



          Reason Tests
          JDK v1.8.0_74 Failed junit tests hadoop.fs.shell.find.TestPrint0
            hadoop.fs.shell.find.TestIname
            hadoop.ipc.TestIPC
            hadoop.hdfs.web.TestWebHdfsTimeouts
            hadoop.hdfs.TestDFSUpgradeFromImage
            hadoop.hdfs.TestMissingBlocksAlert
            hadoop.hdfs.server.namenode.ha.TestEditLogTailer
            hadoop.hdfs.TestFileAppend
            hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure
            hadoop.hdfs.server.namenode.TestEditLog
            hadoop.hdfs.server.datanode.TestDirectoryScanner
          JDK v1.7.0_95 Failed junit tests hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints
            hadoop.hdfs.server.datanode.TestDataNodeMXBean
            hadoop.metrics2.sink.TestRollingFileSystemSinkWithSecureHdfs
            hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot
            hadoop.hdfs.server.datanode.TestDirectoryScanner



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:0ca8df7
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12791370/HDFS-9239.003.patch
          JIRA Issue HDFS-9239
          Optional Tests asflicense mvnsite compile javac javadoc mvninstall unit xml findbugs checkstyle cc
          uname Linux c056a21b3fb8 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / ff0ee84
          Default Java 1.7.0_95
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_74 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
          findbugs v3.0.0
          cc root-jdk1.8.0_74: https://builds.apache.org/job/PreCommit-HDFS-Build/14714/artifact/patchprocess/diff-compile-cc-root-jdk1.8.0_74.txt
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/14714/artifact/patchprocess/diff-checkstyle-root.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/14714/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.8.0_74.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/14714/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_74.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/14714/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
          unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/14714/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.8.0_74.txt https://builds.apache.org/job/PreCommit-HDFS-Build/14714/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_74.txt https://builds.apache.org/job/PreCommit-HDFS-Build/14714/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
          JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/14714/testReport/
          modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/14714/console
          Powered by Apache Yetus 0.3.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 25s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 5 new or modified test files. 0 mvndep 0m 31s Maven dependency ordering for branch +1 mvninstall 8m 58s trunk passed +1 compile 12m 32s trunk passed with JDK v1.8.0_74 +1 compile 10m 15s trunk passed with JDK v1.7.0_95 +1 checkstyle 1m 41s trunk passed +1 mvnsite 2m 44s trunk passed +1 mvneclipse 0m 37s trunk passed +1 findbugs 4m 38s trunk passed +1 javadoc 3m 0s trunk passed with JDK v1.8.0_74 +1 javadoc 3m 45s trunk passed with JDK v1.7.0_95 0 mvndep 0m 20s Maven dependency ordering for patch +1 mvninstall 1m 59s the patch passed +1 compile 12m 50s the patch passed with JDK v1.8.0_74 -1 cc 15m 10s root-jdk1.8.0_74 with JDK v1.8.0_74 generated 1 new + 9 unchanged - 1 fixed = 10 total (was 10) +1 cc 12m 50s the patch passed +1 javac 12m 50s the patch passed +1 compile 10m 11s the patch passed with JDK v1.7.0_95 +1 cc 10m 11s the patch passed +1 javac 10m 11s the patch passed -1 checkstyle 1m 46s root: patch generated 7 new + 1053 unchanged - 6 fixed = 1060 total (was 1059) +1 mvnsite 2m 44s the patch passed +1 mvneclipse 0m 42s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 xml 0m 1s The patch has no ill-formed XML file. +1 findbugs 5m 33s the patch passed +1 javadoc 3m 1s the patch passed with JDK v1.8.0_74 +1 javadoc 3m 53s the patch passed with JDK v1.7.0_95 -1 unit 11m 14s hadoop-common in the patch failed with JDK v1.8.0_74. -1 unit 95m 57s hadoop-hdfs in the patch failed with JDK v1.8.0_74. +1 unit 10m 24s hadoop-common in the patch passed with JDK v1.7.0_95. -1 unit 79m 57s hadoop-hdfs in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 26s Patch does not generate ASF License warnings. 292m 8s Reason Tests JDK v1.8.0_74 Failed junit tests hadoop.fs.shell.find.TestPrint0   hadoop.fs.shell.find.TestIname   hadoop.ipc.TestIPC   hadoop.hdfs.web.TestWebHdfsTimeouts   hadoop.hdfs.TestDFSUpgradeFromImage   hadoop.hdfs.TestMissingBlocksAlert   hadoop.hdfs.server.namenode.ha.TestEditLogTailer   hadoop.hdfs.TestFileAppend   hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure   hadoop.hdfs.server.namenode.TestEditLog   hadoop.hdfs.server.datanode.TestDirectoryScanner JDK v1.7.0_95 Failed junit tests hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints   hadoop.hdfs.server.datanode.TestDataNodeMXBean   hadoop.metrics2.sink.TestRollingFileSystemSinkWithSecureHdfs   hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot   hadoop.hdfs.server.datanode.TestDirectoryScanner Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12791370/HDFS-9239.003.patch JIRA Issue HDFS-9239 Optional Tests asflicense mvnsite compile javac javadoc mvninstall unit xml findbugs checkstyle cc uname Linux c056a21b3fb8 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / ff0ee84 Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_74 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 cc root-jdk1.8.0_74: https://builds.apache.org/job/PreCommit-HDFS-Build/14714/artifact/patchprocess/diff-compile-cc-root-jdk1.8.0_74.txt checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/14714/artifact/patchprocess/diff-checkstyle-root.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/14714/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.8.0_74.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/14714/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_74.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/14714/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/14714/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.8.0_74.txt https://builds.apache.org/job/PreCommit-HDFS-Build/14714/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_74.txt https://builds.apache.org/job/PreCommit-HDFS-Build/14714/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/14714/testReport/ modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/14714/console Powered by Apache Yetus 0.3.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          kihwal Kihwal Lee added a comment -

          Filed HDFS-9905 for TestWebHdfsTimeouts.

          Show
          kihwal Kihwal Lee added a comment - Filed HDFS-9905 for TestWebHdfsTimeouts.
          Hide
          cnauroth Chris Nauroth added a comment -

          The test failures are unrelated. The remaining style warnings are not worth addressing. Tsz Wo Nicholas Sze, would you please take a look at patch v003 and my comments that go with it? Thank you.

          Show
          cnauroth Chris Nauroth added a comment - The test failures are unrelated. The remaining style warnings are not worth addressing. Tsz Wo Nicholas Sze , would you please take a look at patch v003 and my comments that go with it? Thank you.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          +1 the new patch looks good. Thanks for the update.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - +1 the new patch looks good. Thanks for the update.
          Hide
          cnauroth Chris Nauroth added a comment -

          I have committed this to trunk, branch-2 and branch-2.8. Anu Engineer and Tsz Wo Nicholas Sze, thank you for the code reviews. Everyone, thank you for participating in the discussion.

          Show
          cnauroth Chris Nauroth added a comment - I have committed this to trunk, branch-2 and branch-2.8. Anu Engineer and Tsz Wo Nicholas Sze , thank you for the code reviews. Everyone, thank you for participating in the discussion.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #9426 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9426/)
          HDFS-9239. DataNode Lifeline Protocol: an alternative protocol for (cnauroth: rev 2759689d7d23001f007cb0dbe2521de90734dd5c)

          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/DatanodeLifelineProtocolPB.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/metrics/DataNodeMetrics.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java
          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeRegister.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          • hadoop-common-project/hadoop-common/src/site/markdown/Metrics.md
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/DatanodeLifelineProtocol.java
          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBpServiceActorScheduler.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
          • hadoop-hdfs-project/hadoop-hdfs/pom.xml
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolManager.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBPOfferService.java
          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBlockPoolManager.java
          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeLifeline.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/NamenodeProtocols.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/DatanodeLifelineProtocolServerSideTranslatorPB.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/proto/DatanodeLifelineProtocol.proto
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/DatanodeLifelineProtocolClientSideTranslatorPB.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DNConf.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #9426 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9426/ ) HDFS-9239 . DataNode Lifeline Protocol: an alternative protocol for (cnauroth: rev 2759689d7d23001f007cb0dbe2521de90734dd5c) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/DatanodeLifelineProtocolPB.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/metrics/DataNodeMetrics.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeRegister.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java hadoop-common-project/hadoop-common/src/site/markdown/Metrics.md hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/DatanodeLifelineProtocol.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBpServiceActorScheduler.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java hadoop-hdfs-project/hadoop-hdfs/pom.xml hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolManager.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBPOfferService.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBlockPoolManager.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeLifeline.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/NamenodeProtocols.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/DatanodeLifelineProtocolServerSideTranslatorPB.java hadoop-hdfs-project/hadoop-hdfs/src/main/proto/DatanodeLifelineProtocol.proto hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/DatanodeLifelineProtocolClientSideTranslatorPB.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DNConf.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java
          Hide
          vinayrpet Vinayakumar B added a comment -

          Even though this Jira along with HDFS-9311 is New Feature, IMO it would be worth to merge to Branch-2.7. Lot of deployments would be benefited.
          What you think Kihwal Lee?
          If okay, I would be happy to post a patch for branch-2.7, if available patches not apply on branch-2.7.

          Show
          vinayrpet Vinayakumar B added a comment - Even though this Jira along with HDFS-9311 is New Feature, IMO it would be worth to merge to Branch-2.7. Lot of deployments would be benefited. What you think Kihwal Lee ? If okay, I would be happy to post a patch for branch-2.7, if available patches not apply on branch-2.7.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          I'm nervous about pushing stuff back. At the very least, ship Hadoop 2.8 and see if things break

          Show
          stevel@apache.org Steve Loughran added a comment - I'm nervous about pushing stuff back. At the very least, ship Hadoop 2.8 and see if things break
          Hide
          cnauroth Chris Nauroth added a comment -

          I also am hesitant to backport a patch of this size right now, but I might reconsider after getting more production experience with the feature.

          Show
          cnauroth Chris Nauroth added a comment - I also am hesitant to backport a patch of this size right now, but I might reconsider after getting more production experience with the feature.
          Hide
          nroberts Nathan Roberts added a comment -

          However,making it lighter on the datanode side is a good idea. We have seen many cases where nodes are declared dead because the service actor thread is delayed/blocked.

          Just a quick update on this comment. Even after HDFS-7060 we still had cases where Datanodes would fail to heartbeat in. We eventually tracked this down to the RHEL CFQ I/O scheduler. There are situations where significant seek activity (like a massive shuffle) can cause this I/O scheduler to indefinitely starve writers. This eventually causes the datanode and/or nodemanager processes to completely stop (probably due to logging I/O backing up). So, no matter how smart we make these daemons, they are going to be lost from the NN/RM point of view in these situations. But, this is actually probably the right thing to do in these cases, these daemons are clearly not able to do their job so SHOULD be declared lost.

          In any event, the change which we found most valuable for this situation was to use the deadline I/O scheduler. This dramatically improved the number of lost datanodes and nodemanagers we were seeing.

          Show
          nroberts Nathan Roberts added a comment - However,making it lighter on the datanode side is a good idea. We have seen many cases where nodes are declared dead because the service actor thread is delayed/blocked. Just a quick update on this comment. Even after HDFS-7060 we still had cases where Datanodes would fail to heartbeat in. We eventually tracked this down to the RHEL CFQ I/O scheduler. There are situations where significant seek activity (like a massive shuffle) can cause this I/O scheduler to indefinitely starve writers. This eventually causes the datanode and/or nodemanager processes to completely stop (probably due to logging I/O backing up). So, no matter how smart we make these daemons, they are going to be lost from the NN/RM point of view in these situations. But, this is actually probably the right thing to do in these cases, these daemons are clearly not able to do their job so SHOULD be declared lost. In any event, the change which we found most valuable for this situation was to use the deadline I/O scheduler. This dramatically improved the number of lost datanodes and nodemanagers we were seeing.
          Hide
          cnauroth Chris Nauroth added a comment -

          Nathan Roberts, fascinating! Thank you for sharing. Just to make sure I'm clear, are you talking about configuring the deadline scheduler as described here?

          https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Tuning_and_Optimizing_Red_Hat_Enterprise_Linux_for_Oracle_9i_and_10g_Databases/sect-Oracle_9i_and_10g_Tuning_Guide-Kernel_Boot_Parameters-The_IO_Scheduler.html

          Also, did you find any relevant additional scheduler tuning configuration was needed, as described here?

          https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/ch06s04s02.html

          Show
          cnauroth Chris Nauroth added a comment - Nathan Roberts , fascinating! Thank you for sharing. Just to make sure I'm clear, are you talking about configuring the deadline scheduler as described here? https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Tuning_and_Optimizing_Red_Hat_Enterprise_Linux_for_Oracle_9i_and_10g_Databases/sect-Oracle_9i_and_10g_Tuning_Guide-Kernel_Boot_Parameters-The_IO_Scheduler.html Also, did you find any relevant additional scheduler tuning configuration was needed, as described here? https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/ch06s04s02.html
          Hide
          nroberts Nathan Roberts added a comment -

          Just to make sure I'm clear, are you talking about configuring the deadline scheduler as described here?

          Yes, those links are talking about the right parameters.

          We currently run with read_expire=1000, write_expire=1000, and writes_starved=1. Since our I/O workloads change dramatically over time, we didn't spend a lot of time looking for optimal values here. These have been working well for the last several months across multiple clusters.

          As an aside, a relatively easy way to reproduce this problem, is to put a heavy seek load on all the disks of a datanode (e.g. http://www.linuxinsight.com/how_fast_is_your_disk.html, I believe 5-10 copies of seeker were sufficient.) After a minute or so, system becomes almost unusable and datanode will be declared lost. This might be a good test to run against the lifeline protocol. My hunch is, with CFQ, the datanode will still be lost.

          Show
          nroberts Nathan Roberts added a comment - Just to make sure I'm clear, are you talking about configuring the deadline scheduler as described here? Yes, those links are talking about the right parameters. We currently run with read_expire=1000, write_expire=1000, and writes_starved=1. Since our I/O workloads change dramatically over time, we didn't spend a lot of time looking for optimal values here. These have been working well for the last several months across multiple clusters. As an aside, a relatively easy way to reproduce this problem, is to put a heavy seek load on all the disks of a datanode (e.g. http://www.linuxinsight.com/how_fast_is_your_disk.html , I believe 5-10 copies of seeker were sufficient.) After a minute or so, system becomes almost unusable and datanode will be declared lost. This might be a good test to run against the lifeline protocol. My hunch is, with CFQ, the datanode will still be lost.

            People

            • Assignee:
              cnauroth Chris Nauroth
              Reporter:
              cnauroth Chris Nauroth
            • Votes:
              0 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development