Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9305

Delayed heartbeat processing causes storm of subsequent heartbeats

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.7.1
    • Fix Version/s: 2.8.0, 2.7.2, 3.0.0-alpha1
    • Component/s: datanode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      A DataNode typically sends a heartbeat to the NameNode every 3 seconds. We expect heartbeat handling to complete relatively quickly. However, if something unexpected causes heartbeat processing to get blocked, such as a long GC or heavy lock contention within the NameNode, then heartbeat processing would be delayed. After recovering from this delay, the DataNode then starts sending a storm of heartbeat messages in a tight loop. In a large cluster with many DataNodes, this storm of heartbeat messages could cause harmful load on the NameNode and make overall cluster recovery more difficult.

      The bug appears to be caused by incorrect timekeeping inside BPServiceActor. The next heartbeat time is always calculated as a delta from the previous heartbeat time, without any compensation for possible long latency on an individual heartbeat RPC. The only mitigation would be restarting all DataNodes to force a reset of the heartbeat schedule, or simply wait out the storm until the scheduling catches up and corrects itself.

      This problem would not manifest after a NameNode restart. In that case, the NameNode would respond to the first heartbeat by telling the DataNode to re-register, and BPServiceActor#reRegister would reset the heartbeat schedule to the current time. I believe the problem would only manifest if the NameNode process kept alive, but processed heartbeats unexpectedly slowly.

      1. HDFS-9305.01.patch
        3 kB
        Arpit Agarwal
      2. HDFS-9305.02.patch
        3 kB
        Arpit Agarwal

        Activity

        Hide
        hadoopqa Hadoop QA added a comment -



        -1 overall



        Vote Subsystem Runtime Comment
        -1 patch 0m 0s The patch command could not apply the patch during dryrun.



        Subsystem Report/Notes
        Patch URL http://issues.apache.org/jira/secure/attachment/12768761/HDFS-9305.01.patch
        Optional Tests javadoc javac unit findbugs checkstyle
        git revision trunk / 123b3db
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/13194/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 patch 0m 0s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12768761/HDFS-9305.01.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 123b3db Console output https://builds.apache.org/job/PreCommit-HDFS-Build/13194/console This message was automatically generated.
        Hide
        andrew.wang Andrew Wang added a comment -

        LGTM, though the test has some whitespace errors. +1 pending, feel free to fix at commit time via "git apply --whitespace=fix".

        Show
        andrew.wang Andrew Wang added a comment - LGTM, though the test has some whitespace errors. +1 pending, feel free to fix at commit time via "git apply --whitespace=fix".
        Hide
        hadoopqa Hadoop QA added a comment -



        -1 overall



        Vote Subsystem Runtime Comment
        -1 pre-patch 20m 35s Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
        +1 javac 8m 52s There were no new javac warning messages.
        +1 javadoc 11m 45s There were no new javadoc warning messages.
        +1 release audit 0m 35s The applied patch does not increase the total number of release audit warnings.
        +1 checkstyle 1m 39s There were no new checkstyle issues.
        -1 whitespace 0m 0s The patch has 3 line(s) that end in whitespace. Use git apply --whitespace=fix.
        +1 install 2m 9s mvn install still works.
        +1 eclipse:eclipse 0m 41s The patch built with eclipse:eclipse.
        +1 findbugs 2m 50s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
        +1 native 3m 57s Pre-build of native portion
        -1 hdfs tests 63m 50s Tests failed in hadoop-hdfs.
            116m 57s  



        Reason Tests
        Failed unit tests hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes
          hadoop.hdfs.server.blockmanagement.TestNodeCount
          hadoop.hdfs.TestWriteReadStripedFile
          hadoop.hdfs.server.namenode.TestFileTruncate



        Subsystem Report/Notes
        Patch URL http://issues.apache.org/jira/secure/attachment/12768795/HDFS-9305.02.patch
        Optional Tests javadoc javac unit findbugs checkstyle
        git revision trunk / 3cc7377
        Pre-patch Findbugs warnings https://builds.apache.org/job/PreCommit-HDFS-Build/13197/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html
        whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/13197/artifact/patchprocess/whitespace.txt
        hadoop-hdfs test log https://builds.apache.org/job/PreCommit-HDFS-Build/13197/artifact/patchprocess/testrun_hadoop-hdfs.txt
        Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/13197/testReport/
        Java 1.7.0_55
        uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/13197/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 pre-patch 20m 35s Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 8m 52s There were no new javac warning messages. +1 javadoc 11m 45s There were no new javadoc warning messages. +1 release audit 0m 35s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 1m 39s There were no new checkstyle issues. -1 whitespace 0m 0s The patch has 3 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 install 2m 9s mvn install still works. +1 eclipse:eclipse 0m 41s The patch built with eclipse:eclipse. +1 findbugs 2m 50s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 native 3m 57s Pre-build of native portion -1 hdfs tests 63m 50s Tests failed in hadoop-hdfs.     116m 57s   Reason Tests Failed unit tests hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes   hadoop.hdfs.server.blockmanagement.TestNodeCount   hadoop.hdfs.TestWriteReadStripedFile   hadoop.hdfs.server.namenode.TestFileTruncate Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12768795/HDFS-9305.02.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 3cc7377 Pre-patch Findbugs warnings https://builds.apache.org/job/PreCommit-HDFS-Build/13197/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/13197/artifact/patchprocess/whitespace.txt hadoop-hdfs test log https://builds.apache.org/job/PreCommit-HDFS-Build/13197/artifact/patchprocess/testrun_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/13197/testReport/ Java 1.7.0_55 uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-HDFS-Build/13197/console This message was automatically generated.
        Hide
        arpitagarwal Arpit Agarwal added a comment -

        Thank you for the review Andrew Wang and thanks for reporting this Chris Nauroth. I've committed this to trunk, branch-2 and branch-2.7.

        Show
        arpitagarwal Arpit Agarwal added a comment - Thank you for the review Andrew Wang and thanks for reporting this Chris Nauroth . I've committed this to trunk, branch-2 and branch-2.7.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #8711 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8711/)
        HDFS-9305. Delayed heartbeat processing causes storm of subsequent (arp: rev d8736eb9ca351b82854601ea3b1fbc3c9fab44e4)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBpServiceActorScheduler.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #8711 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8711/ ) HDFS-9305 . Delayed heartbeat processing causes storm of subsequent (arp: rev d8736eb9ca351b82854601ea3b1fbc3c9fab44e4) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBpServiceActorScheduler.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #588 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/588/)
        HDFS-9305. Delayed heartbeat processing causes storm of subsequent (arp: rev d8736eb9ca351b82854601ea3b1fbc3c9fab44e4)

        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBpServiceActorScheduler.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #588 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/588/ ) HDFS-9305 . Delayed heartbeat processing causes storm of subsequent (arp: rev d8736eb9ca351b82854601ea3b1fbc3c9fab44e4) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBpServiceActorScheduler.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #600 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/600/)
        HDFS-9305. Delayed heartbeat processing causes storm of subsequent (arp: rev d8736eb9ca351b82854601ea3b1fbc3c9fab44e4)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBpServiceActorScheduler.java
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #600 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/600/ ) HDFS-9305 . Delayed heartbeat processing causes storm of subsequent (arp: rev d8736eb9ca351b82854601ea3b1fbc3c9fab44e4) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBpServiceActorScheduler.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Yarn-trunk #1324 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1324/)
        HDFS-9305. Delayed heartbeat processing causes storm of subsequent (arp: rev d8736eb9ca351b82854601ea3b1fbc3c9fab44e4)

        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBpServiceActorScheduler.java
        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #1324 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1324/ ) HDFS-9305 . Delayed heartbeat processing causes storm of subsequent (arp: rev d8736eb9ca351b82854601ea3b1fbc3c9fab44e4) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBpServiceActorScheduler.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk #2478 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2478/)
        HDFS-9305. Delayed heartbeat processing causes storm of subsequent (arp: rev d8736eb9ca351b82854601ea3b1fbc3c9fab44e4)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBpServiceActorScheduler.java
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2478 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2478/ ) HDFS-9305 . Delayed heartbeat processing causes storm of subsequent (arp: rev d8736eb9ca351b82854601ea3b1fbc3c9fab44e4) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBpServiceActorScheduler.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk #2531 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2531/)
        HDFS-9305. Delayed heartbeat processing causes storm of subsequent (arp: rev d8736eb9ca351b82854601ea3b1fbc3c9fab44e4)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBpServiceActorScheduler.java
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2531 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2531/ ) HDFS-9305 . Delayed heartbeat processing causes storm of subsequent (arp: rev d8736eb9ca351b82854601ea3b1fbc3c9fab44e4) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBpServiceActorScheduler.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #541 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/541/)
        HDFS-9305. Delayed heartbeat processing causes storm of subsequent (arp: rev d8736eb9ca351b82854601ea3b1fbc3c9fab44e4)

        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBpServiceActorScheduler.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #541 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/541/ ) HDFS-9305 . Delayed heartbeat processing causes storm of subsequent (arp: rev d8736eb9ca351b82854601ea3b1fbc3c9fab44e4) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBpServiceActorScheduler.java
        Hide
        atm Aaron T. Myers added a comment -

        Setting the fix version to 2.7.2. Arpit Agarwal - if that's not right, please change it appropriately.

        Show
        atm Aaron T. Myers added a comment - Setting the fix version to 2.7.2. Arpit Agarwal - if that's not right, please change it appropriately.
        Hide
        arpitagarwal Arpit Agarwal added a comment -

        That is right, thanks for updating it.

        Show
        arpitagarwal Arpit Agarwal added a comment - That is right, thanks for updating it.

          People

          • Assignee:
            arpitagarwal Arpit Agarwal
            Reporter:
            cnauroth Chris Nauroth
          • Votes:
            0 Vote for this issue
            Watchers:
            18 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development