Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10688

BPServiceActor may run into a tight loop for sending block report when hitting IOException

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: datanode
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Currently in BPServiceActor#offerService, when datanode runs into a local IOException, the DataNode only logs the exception and runs into the while loop again:

            } catch(RemoteException re) {
              .......
              LOG.warn("RemoteException in offerService", re);
              try {
                long sleepTime = Math.min(1000, dnConf.heartBeatInterval);
                Thread.sleep(sleepTime);
              } catch (InterruptedException ie) {
                Thread.currentThread().interrupt();
              }
            } catch (IOException e) {
              LOG.warn("IOException in offerService", e);
            }
      

      This tight loop may cause some issue. For example, in a production cluster, we saw a DataNode hit exception when doing kerberos realm lookup. This tight loop finally caused the DataNode to send hundreds of DNS lookup queries per second.

      1. HDFS-10688.001.patch
        1 kB
        Chen Liang
      2. HDFS-10688.002.patch
        1 kB
        Chen Liang

        Activity

        Hide
        jingzhao Jing Zhao added a comment - - edited

        We can follow the same sleep logic as RemoteException when handling IOException.

        Show
        jingzhao Jing Zhao added a comment - - edited We can follow the same sleep logic as RemoteException when handling IOException.
        Hide
        arpitagarwal Arpit Agarwal added a comment -

        Nice find Jing Zhao!

        Show
        arpitagarwal Arpit Agarwal added a comment - Nice find Jing Zhao !
        Hide
        vagarychen Chen Liang added a comment -

        Added a simple patch to fix this by sleeping for a while before re-entering the loop.

        Show
        vagarychen Chen Liang added a comment - Added a simple patch to fix this by sleeping for a while before re-entering the loop.
        Hide
        jingzhao Jing Zhao added a comment -

        Thanks for the fix, Chen Liang! Maybe we can rename the new sleep method to a more specific name like sleepAfterException? Other than this +1.

        Show
        jingzhao Jing Zhao added a comment - Thanks for the fix, Chen Liang ! Maybe we can rename the new sleep method to a more specific name like sleepAfterException ? Other than this +1.
        Hide
        vagarychen Chen Liang added a comment -

        Thanks for the review and suggestion! Just submitted an updated patch

        Show
        vagarychen Chen Liang added a comment - Thanks for the review and suggestion! Just submitted an updated patch
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 12m 29s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 6m 44s trunk passed
        +1 compile 0m 46s trunk passed
        +1 checkstyle 0m 26s trunk passed
        +1 mvnsite 0m 51s trunk passed
        +1 mvneclipse 0m 12s trunk passed
        +1 findbugs 1m 40s trunk passed
        +1 javadoc 0m 56s trunk passed
        +1 mvninstall 0m 46s the patch passed
        +1 compile 0m 41s the patch passed
        +1 javac 0m 41s the patch passed
        +1 checkstyle 0m 23s the patch passed
        +1 mvnsite 0m 47s the patch passed
        +1 mvneclipse 0m 10s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 44s the patch passed
        +1 javadoc 0m 52s the patch passed
        +1 unit 57m 7s hadoop-hdfs in the patch passed.
        +1 asflicense 0m 18s The patch does not generate ASF License warnings.
        88m 3s



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:9560f25
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12820023/HDFS-10688.002.patch
        JIRA Issue HDFS-10688
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 9639cb67488c 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 703fdf8
        Default Java 1.8.0_101
        findbugs v3.0.0
        Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16172/testReport/
        modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16172/console
        Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 12m 29s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 6m 44s trunk passed +1 compile 0m 46s trunk passed +1 checkstyle 0m 26s trunk passed +1 mvnsite 0m 51s trunk passed +1 mvneclipse 0m 12s trunk passed +1 findbugs 1m 40s trunk passed +1 javadoc 0m 56s trunk passed +1 mvninstall 0m 46s the patch passed +1 compile 0m 41s the patch passed +1 javac 0m 41s the patch passed +1 checkstyle 0m 23s the patch passed +1 mvnsite 0m 47s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 44s the patch passed +1 javadoc 0m 52s the patch passed +1 unit 57m 7s hadoop-hdfs in the patch passed. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 88m 3s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12820023/HDFS-10688.002.patch JIRA Issue HDFS-10688 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 9639cb67488c 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 703fdf8 Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16172/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16172/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        jingzhao Jing Zhao added a comment -

        I've committed the patch to trunk, branch-2 and branch-2.8. Thanks Chen Liang for the contribution!

        Show
        jingzhao Jing Zhao added a comment - I've committed the patch to trunk, branch-2 and branch-2.8. Thanks Chen Liang for the contribution!
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-trunk-Commit #10147 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10147/)
        HDFS-10688. BPServiceActor may run into a tight loop for sending block (jing9: rev 0cde9e12a7175e4d8bc4ccd5c36055b280d1fbd6)

        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #10147 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10147/ ) HDFS-10688 . BPServiceActor may run into a tight loop for sending block (jing9: rev 0cde9e12a7175e4d8bc4ccd5c36055b280d1fbd6) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java

          People

          • Assignee:
            vagarychen Chen Liang
            Reporter:
            jingzhao Jing Zhao
          • Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development