Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-4315

DNs with multiple BPs can have BPOfferServices fail to start due to unsynchronized map access

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.2-alpha
    • Fix Version/s: 2.0.3-alpha, 0.23.6
    • Component/s: datanode
    • Labels:
      None

      Description

      In some nightly test runs we've seen pretty frequent failures of TestWebHdfsWithMultipleNameNodes. I've traced the root cause to an unsynchronized map access in the DataStorage class.

      More details in the first comment.

      1. HDFS-4315.patch
        2 kB
        Aaron T. Myers

        Activity

        Hide
        Aaron T. Myers added a comment -

        In all of the failing test runs that I saw, the client would end up failing with an error like the following:

        2012-12-14 16:30:36,818 WARN  hdfs.DFSClient (DFSOutputStream.java:run(562)) - DataStreamer Exception
        java.io.IOException: Failed to add a datanode.  User may turn off this feature by setting dfs.client.block.write.replace-datanode-on-failure.policy in configuration, where the current policy is DEFAULT.  (Nodes: current=[127.0.0.1:52552, 127.0.0.1:43557], original=[127.0.0.1:43557, 127.0.0.1:52552])
          at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:792)
          at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:852)
          at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:958) 
          at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:469)
        

        This suggests that either an entire DN or one of the BPOfferServices of one of the DNs was not starting correctly, or had not started by the time the client was trying to access it. Unfortunately, TestWebHdfsWithMultipleNameNodes disables the DN logger, so it wasn't obvious what was causing that problem. Upon changing the test to not disable the logger and looping the test, I would occasionally see an error like the following:

        java.lang.NullPointerException
          at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:850)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:819)
          at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:308)
          at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:218)
          at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:660)
          at java.lang.Thread.run(Thread.java:662)
        

        This error would cause one of the BPOfferServices in one of the DNs to not come up. The reason for this is that concurrent, unsynchronized puts to the HashMap DataStorage#bpStorageMap results in undefined behavior, including previously-included entries no longer appearing to be in the map.

        Show
        Aaron T. Myers added a comment - In all of the failing test runs that I saw, the client would end up failing with an error like the following: 2012-12-14 16:30:36,818 WARN hdfs.DFSClient (DFSOutputStream.java:run(562)) - DataStreamer Exception java.io.IOException: Failed to add a datanode. User may turn off this feature by setting dfs.client.block.write.replace-datanode-on-failure.policy in configuration, where the current policy is DEFAULT. (Nodes: current=[127.0.0.1:52552, 127.0.0.1:43557], original=[127.0.0.1:43557, 127.0.0.1:52552]) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:792) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:852) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:958) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:469) This suggests that either an entire DN or one of the BPOfferServices of one of the DNs was not starting correctly, or had not started by the time the client was trying to access it. Unfortunately, TestWebHdfsWithMultipleNameNodes disables the DN logger, so it wasn't obvious what was causing that problem. Upon changing the test to not disable the logger and looping the test, I would occasionally see an error like the following: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:850) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:819) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:308) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:218) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:660) at java.lang.Thread.run(Thread.java:662) This error would cause one of the BPOfferServices in one of the DNs to not come up. The reason for this is that concurrent, unsynchronized puts to the HashMap DataStorage#bpStorageMap results in undefined behavior, including previously-included entries no longer appearing to be in the map.
        Hide
        Aaron T. Myers added a comment -

        Here's a patch which addresses the issue. I've been looping the test for an hour now with no failures, whereas before it used to fail pretty reliably within 10 minutes. I'll keep it looping over the weekend and see how it goes.

        This patch also takes the liberty of re-enabling the DN log in TestWebHdfsWithMultipleNameNodes, so that we can better see the root cause of later failures.

        Show
        Aaron T. Myers added a comment - Here's a patch which addresses the issue. I've been looping the test for an hour now with no failures, whereas before it used to fail pretty reliably within 10 minutes. I'll keep it looping over the weekend and see how it goes. This patch also takes the liberty of re-enabling the DN log in TestWebHdfsWithMultipleNameNodes, so that we can better see the root cause of later failures.
        Hide
        Eli Collins added a comment -

        Nice find!

        +1 pending jenkins

        Show
        Eli Collins added a comment - Nice find! +1 pending jenkins
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12561073/HDFS-4315.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3668//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3668//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12561073/HDFS-4315.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3668//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3668//console This message is automatically generated.
        Hide
        Aaron T. Myers added a comment -

        I looped this test 5923 times for about 50 hours and it never failed. I'm going to conclude based on this that this patch fixes the issue.

        I'm going to commit this momentarily.

        Show
        Aaron T. Myers added a comment - I looped this test 5923 times for about 50 hours and it never failed. I'm going to conclude based on this that this patch fixes the issue. I'm going to commit this momentarily.
        Hide
        Aaron T. Myers added a comment -

        I've just committed this to trunk and branch-2.

        Thanks a lot for the review, Eli.

        Show
        Aaron T. Myers added a comment - I've just committed this to trunk and branch-2. Thanks a lot for the review, Eli.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-trunk-Commit #3130 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3130/)
        HDFS-4315. DNs with multiple BPs can have BPOfferServices fail to start due to unsynchronized map access. Contributed by Aaron T. Myers. (Revision 1422778)

        Result = SUCCESS
        atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1422778
        Files :

        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHdfsWithMultipleNameNodes.java
        Show
        Hudson added a comment - Integrated in Hadoop-trunk-Commit #3130 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3130/ ) HDFS-4315 . DNs with multiple BPs can have BPOfferServices fail to start due to unsynchronized map access. Contributed by Aaron T. Myers. (Revision 1422778) Result = SUCCESS atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1422778 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHdfsWithMultipleNameNodes.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Yarn-trunk #68 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/68/)
        HDFS-4315. DNs with multiple BPs can have BPOfferServices fail to start due to unsynchronized map access. Contributed by Aaron T. Myers. (Revision 1422778)

        Result = SUCCESS
        atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1422778
        Files :

        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHdfsWithMultipleNameNodes.java
        Show
        Hudson added a comment - Integrated in Hadoop-Yarn-trunk #68 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/68/ ) HDFS-4315 . DNs with multiple BPs can have BPOfferServices fail to start due to unsynchronized map access. Contributed by Aaron T. Myers. (Revision 1422778) Result = SUCCESS atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1422778 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHdfsWithMultipleNameNodes.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #1257 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1257/)
        HDFS-4315. DNs with multiple BPs can have BPOfferServices fail to start due to unsynchronized map access. Contributed by Aaron T. Myers. (Revision 1422778)

        Result = FAILURE
        atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1422778
        Files :

        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHdfsWithMultipleNameNodes.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1257 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1257/ ) HDFS-4315 . DNs with multiple BPs can have BPOfferServices fail to start due to unsynchronized map access. Contributed by Aaron T. Myers. (Revision 1422778) Result = FAILURE atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1422778 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHdfsWithMultipleNameNodes.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #1288 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1288/)
        HDFS-4315. DNs with multiple BPs can have BPOfferServices fail to start due to unsynchronized map access. Contributed by Aaron T. Myers. (Revision 1422778)

        Result = SUCCESS
        atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1422778
        Files :

        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHdfsWithMultipleNameNodes.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1288 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1288/ ) HDFS-4315 . DNs with multiple BPs can have BPOfferServices fail to start due to unsynchronized map access. Contributed by Aaron T. Myers. (Revision 1422778) Result = SUCCESS atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1422778 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHdfsWithMultipleNameNodes.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Build #467 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/467/)
        HDFS-4315. DNs with multiple BPs can have BPOfferServices fail to start due to unsynchronized map access. (atm via tgraves) (Revision 1422972)

        Result = UNSTABLE
        tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1422972
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java
        • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHdfsWithMultipleNameNodes.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #467 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/467/ ) HDFS-4315 . DNs with multiple BPs can have BPOfferServices fail to start due to unsynchronized map access. (atm via tgraves) (Revision 1422972) Result = UNSTABLE tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1422972 Files : /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHdfsWithMultipleNameNodes.java

          People

          • Assignee:
            Aaron T. Myers
            Reporter:
            Aaron T. Myers
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development