Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-5299

DFS client hangs in updatePipeline RPC when failover happened

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.1.0-beta, 3.0.0-alpha1
    • Fix Version/s: 2.2.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      DFSClient got hanged in updatedPipeline call to namenode when the failover happened at exactly sametime.

      When we digged down, issue found to be with handling the RetryCache in updatePipeline.

      Here are the steps :
      1. Client was writing slowly.
      2. One of the datanode was down and updatePipeline was called to ANN.
      3. Call reached the ANN, while processing updatePipeline call it got shutdown.
      3. Now Client retried (Since the api marked as AtMostOnce) to another NameNode. at that time still NN was in STANDBY and got StandbyException.
      4. Now one more time client failover happened.
      5. Now SNN became Active.
      6. Client called to current ANN again for updatePipeline,

      Now client call got hanged in NN, waiting for the cached call with same callid to be over. But this cached call is already got over last time with StandbyException.

      Conclusion :
      Always whenever the new entry is added to cache we need to update the result of the call before returning the call or throwing exception.
      I can see similar issue multiple RPCs in FSNameSystem.

      1. HDFS-5299.000.patch
        11 kB
        Jing Zhao
      2. HDFS-5299.patch
        8 kB
        Vinayakumar B

        Activity

        Hide
        vinayrpet Vinayakumar B added a comment -

        Attaching the patch. Please review

        Show
        vinayrpet Vinayakumar B added a comment - Attaching the patch. Please review
        Hide
        hadoopqa Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12606736/HDFS-5299.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5094//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5094//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12606736/HDFS-5299.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5094//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5094//console This message is automatically generated.
        Hide
        umamaheswararao Uma Maheswara Rao G added a comment -

        Nit:

        cluster = new MiniDFSCluster.Builder(new Configuration())
        +        .nnTopology(MiniDFSNNTopology.simpleHATopology()).numDataNodes(1)
        +        .build();
        

        Please reuse the existing conf object, you need not create new one for it.

        Also please add small javadoc for the test.

         CacheEntryWithPayload cacheEntry = RetryCache.waitForCompletion(retryCache,
                null);
            if (cacheEntry != null && cacheEntry.isSuccess()) {
              return (String) cacheEntry.getPayload();
            }
            final FSPermissionChecker pc = getPermissionChecker();
        

        Here if getPermissionChecker throws exception, then similar situation can occur for that call? We will not retry for this exception I think, but the pattern to wait for retry cache and setting state should be proper order to avoid situations like this.

        Show
        umamaheswararao Uma Maheswara Rao G added a comment - Nit: cluster = new MiniDFSCluster.Builder( new Configuration()) + .nnTopology(MiniDFSNNTopology.simpleHATopology()).numDataNodes(1) + .build(); Please reuse the existing conf object, you need not create new one for it. Also please add small javadoc for the test. CacheEntryWithPayload cacheEntry = RetryCache.waitForCompletion(retryCache, null ); if (cacheEntry != null && cacheEntry.isSuccess()) { return ( String ) cacheEntry.getPayload(); } final FSPermissionChecker pc = getPermissionChecker(); Here if getPermissionChecker throws exception, then similar situation can occur for that call? We will not retry for this exception I think, but the pattern to wait for retry cache and setting state should be proper order to avoid situations like this.
        Hide
        jingzhao Jing Zhao added a comment -

        Thanks for the fix Vinay! The patch looks great. Some minors:

        1. The patch requires rebase after HDFS-5300 got committed.
        2. The same comment with Uma about "getPermissionChecker".
        3. Looks like after rebase there will be an extra "checkOperation(OperationCategory.WRITE)" in FSNamesystem#deleteSnapshot.
        4. In FSNamesystem#savenamespace, the checkSuperuserPrivilege call also needs to be moved before the retry cache operation.
        Show
        jingzhao Jing Zhao added a comment - Thanks for the fix Vinay! The patch looks great. Some minors: The patch requires rebase after HDFS-5300 got committed. The same comment with Uma about "getPermissionChecker". Looks like after rebase there will be an extra "checkOperation(OperationCategory.WRITE)" in FSNamesystem#deleteSnapshot. In FSNamesystem#savenamespace, the checkSuperuserPrivilege call also needs to be moved before the retry cache operation.
        Hide
        jingzhao Jing Zhao added a comment -

        Since there are only some trivial changes required, I try to generate a rebased patch which also addresses Uma's comments.

        Please reuse the existing conf object, you need not create new one for it.

        Here I think Vinay's patch is correct. The conf object will be modified by the MiniDFSCluster and the HA conf setting will affect the following tests. So in the new patch, I reuse the conf object but put its creation in the setup method.

        Show
        jingzhao Jing Zhao added a comment - Since there are only some trivial changes required, I try to generate a rebased patch which also addresses Uma's comments. Please reuse the existing conf object, you need not create new one for it. Here I think Vinay's patch is correct. The conf object will be modified by the MiniDFSCluster and the HA conf setting will affect the following tests. So in the new patch, I reuse the conf object but put its creation in the setup method.
        Hide
        brandonli Brandon Li added a comment -

        +1. Patch looks good.

        Show
        brandonli Brandon Li added a comment - +1. Patch looks good.
        Hide
        hadoopqa Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12607029/HDFS-5299.000.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5110//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5110//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12607029/HDFS-5299.000.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5110//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5110//console This message is automatically generated.
        Hide
        umamaheswararao Uma Maheswara Rao G added a comment -

        Thanks a lot Jing for addressing the comments.
        +1 on the latest patch. Also thanks for the review Brandon.

        Show
        umamaheswararao Uma Maheswara Rao G added a comment - Thanks a lot Jing for addressing the comments. +1 on the latest patch. Also thanks for the review Brandon.
        Hide
        vinayrpet Vinayakumar B added a comment -

        Thanks jing for the rebase and addressing uma's comment. I was out of station this weekend, so could not address uma's comment in time.
        Thanks also uma and brendon for reviews.

        Show
        vinayrpet Vinayakumar B added a comment - Thanks jing for the rebase and addressing uma's comment. I was out of station this weekend, so could not address uma's comment in time. Thanks also uma and brendon for reviews.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-trunk-Commit #4553 (See https://builds.apache.org/job/Hadoop-trunk-Commit/4553/)
        HDFS-5299. DFS client hangs in updatePipeline RPC when failover happened. Contributed by Vinay. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1529660)

        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeRetryCache.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #4553 (See https://builds.apache.org/job/Hadoop-trunk-Commit/4553/ ) HDFS-5299 . DFS client hangs in updatePipeline RPC when failover happened. Contributed by Vinay. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1529660 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeRetryCache.java
        Hide
        jingzhao Jing Zhao added a comment -

        Thanks again, Vinay, Uma, and Brandon! I've committed this to trunk, branch-2 and branch-2.1-beta.

        Show
        jingzhao Jing Zhao added a comment - Thanks again, Vinay, Uma, and Brandon! I've committed this to trunk, branch-2 and branch-2.1-beta.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Yarn-trunk #355 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/355/)
        HDFS-5299. DFS client hangs in updatePipeline RPC when failover happened. Contributed by Vinay. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1529660)

        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeRetryCache.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #355 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/355/ ) HDFS-5299 . DFS client hangs in updatePipeline RPC when failover happened. Contributed by Vinay. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1529660 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeRetryCache.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Hdfs-trunk #1545 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1545/)
        HDFS-5299. DFS client hangs in updatePipeline RPC when failover happened. Contributed by Vinay. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1529660)

        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeRetryCache.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Hdfs-trunk #1545 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1545/ ) HDFS-5299 . DFS client hangs in updatePipeline RPC when failover happened. Contributed by Vinay. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1529660 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeRetryCache.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk #1571 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1571/)
        HDFS-5299. DFS client hangs in updatePipeline RPC when failover happened. Contributed by Vinay. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1529660)

        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeRetryCache.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1571 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1571/ ) HDFS-5299 . DFS client hangs in updatePipeline RPC when failover happened. Contributed by Vinay. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1529660 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeRetryCache.java
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        Closing tickets that are already part of a release.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - Closing tickets that are already part of a release.

          People

          • Assignee:
            vinayrpet Vinayakumar B
            Reporter:
            vinayrpet Vinayakumar B
          • Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development