Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-6229

Race condition in failover can cause RetryCache fail to work

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.1.0-beta
    • Fix Version/s: 2.4.1
    • Component/s: ha
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Currently when NN failover happens, the old SBN first sets its state to active, then starts the active services (including tailing all the remaining editlog and building a complete retry cache based on the editlog). If a retry request, which has already succeeded in the old ANN (but the client fails to receive the response), comes in between, this retry may still get served by the new ANN but miss the retry cache.

      1. retrycache-race.patch
        2 kB
        Jing Zhao
      2. HDFS-6229.000.patch
        5 kB
        Jing Zhao

        Activity

        Hide
        Jing Zhao added a comment -

        The patch can mimic the scenario and cause unit tests in TestRetryCacheWithHA timeout.

        Show
        Jing Zhao added a comment - The patch can mimic the scenario and cause unit tests in TestRetryCacheWithHA timeout.
        Hide
        Jing Zhao added a comment -

        We can let HAState#setStateInternal hold locks of both FSNamesystem and FSNamesystem#retryCache. In this way reading retryCache will not happen between the NN failover. Upload an initial patch to fix.

        Show
        Jing Zhao added a comment - We can let HAState#setStateInternal hold locks of both FSNamesystem and FSNamesystem#retryCache. In this way reading retryCache will not happen between the NN failover. Upload an initial patch to fix.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12639639/HDFS-6229.000.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

        org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6647//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6647//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12639639/HDFS-6229.000.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6647//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6647//console This message is automatically generated.
        Hide
        Jing Zhao added a comment -

        The failed unit test should be unrelated.

        Show
        Jing Zhao added a comment - The failed unit test should be unrelated.
        Hide
        Suresh Srinivas added a comment -

        +1 for the patch. Thanks Jing for working on this.

        Show
        Suresh Srinivas added a comment - +1 for the patch. Thanks Jing for working on this.
        Hide
        Jing Zhao added a comment -

        Thanks for the review, Suresh! I've committed this to trunk, branch-2, and branch-2.4.

        Show
        Jing Zhao added a comment - Thanks for the review, Suresh! I've committed this to trunk, branch-2, and branch-2.4.
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in Hadoop-trunk-Commit #5503 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5503/)
        HDFS-6229. Race condition in failover can cause RetryCache fail to work. Contributed by Jing Zhao. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1586714)

        • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RetryCache.java
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java
        Show
        Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #5503 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5503/ ) HDFS-6229 . Race condition in failover can cause RetryCache fail to work. Contributed by Jing Zhao. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1586714 ) /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RetryCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Yarn-trunk #538 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/538/)
        HDFS-6229. Race condition in failover can cause RetryCache fail to work. Contributed by Jing Zhao. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1586714)

        • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RetryCache.java
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java
        Show
        Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #538 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/538/ ) HDFS-6229 . Race condition in failover can cause RetryCache fail to work. Contributed by Jing Zhao. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1586714 ) /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RetryCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Hdfs-trunk #1730 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1730/)
        HDFS-6229. Race condition in failover can cause RetryCache fail to work. Contributed by Jing Zhao. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1586714)

        • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RetryCache.java
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java
        Show
        Hudson added a comment - SUCCESS: Integrated in Hadoop-Hdfs-trunk #1730 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1730/ ) HDFS-6229 . Race condition in failover can cause RetryCache fail to work. Contributed by Jing Zhao. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1586714 ) /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RetryCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1755 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1755/)
        HDFS-6229. Race condition in failover can cause RetryCache fail to work. Contributed by Jing Zhao. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1586714)

        • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RetryCache.java
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java
        Show
        Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1755 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1755/ ) HDFS-6229 . Race condition in failover can cause RetryCache fail to work. Contributed by Jing Zhao. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1586714 ) /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RetryCache.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java

          People

          • Assignee:
            Jing Zhao
            Reporter:
            Jing Zhao
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development