Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-2240

Possible deadlock between LeaseRenewer and its factory

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: hdfs-client
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Lock cycle detected by jcarder

      1. h2240_20110811b.patch
        2 kB
        Tsz Wo Nicholas Sze
      2. h2240_20110811.patch
        2 kB
        Tsz Wo Nicholas Sze
      3. hdfs-2240-2.txt
        2 kB
        Todd Lipcon
      4. h2240_20110808.patch
        0.8 kB
        Tsz Wo Nicholas Sze
      5. lease-renewer-cycle.png
        47 kB
        Todd Lipcon

        Activity

        Hide
        Tsz Wo Nicholas Sze added a comment -

        Todd, thanks for running jcarder. The detected deadlock is a false alert since LeaseRenewer.Factory.get(..) is invoked indirectly by a DFSClient constructor only. DFSClient.close() won't be called inside the constructor.

        Show
        Tsz Wo Nicholas Sze added a comment - Todd, thanks for running jcarder. The detected deadlock is a false alert since LeaseRenewer.Factory.get(..) is invoked indirectly by a DFSClient constructor only. DFSClient.close() won't be called inside the constructor.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        I believe you mean LeaseRenewer instead of LeaseManager.

        Show
        Tsz Wo Nicholas Sze added a comment - I believe you mean LeaseRenewer instead of LeaseManager.
        Hide
        Todd Lipcon added a comment -

        Couldn't one DFSClient be constructing while another closes? I think it's best to have a policy of 0 lock inversions where possible.

        Show
        Todd Lipcon added a comment - Couldn't one DFSClient be constructing while another closes? I think it's best to have a policy of 0 lock inversions where possible.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        Todd, you are right. I will think about this.

        Show
        Tsz Wo Nicholas Sze added a comment - Todd, you are right. I will think about this.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        h2240_20110808.patch: release the factory lock before adding client.

        It seems that it is not easy to add a test. Let me know if you have a good idea to do so.

        Show
        Tsz Wo Nicholas Sze added a comment - h2240_20110808.patch: release the factory lock before adding client. It seems that it is not easy to add a test. Let me know if you have a good idea to do so.
        Hide
        Tsz Wo Nicholas Sze added a comment -
             [exec] -1 overall.  
             [exec] 
             [exec]     +1 @author.  The patch does not contain any @author tags.
             [exec] 
             [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
             [exec]                         Please justify why no new tests are needed for this patch.
             [exec]                         Also please list what manual steps were performed to verify this patch.
             [exec] 
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
             [exec] 
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec] 
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.
             [exec] 
             [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
             [exec] 
             [exec]     +1 system test framework.  The patch passed system test framework compile.
        
        Show
        Tsz Wo Nicholas Sze added a comment - [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no new tests are needed for this patch. [exec] Also please list what manual steps were performed to verify this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 system test framework. The patch passed system test framework compile.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        Ran unit tests. The following failed.

        • TestHDFSCLI expected hostname (letters and dots) but my machine used 127.0.0.1. E.g.
          2011-08-08 23:42:13,024 INFO  cli.CLITestHelper (CLITestHelper.java:displayResults(182)) -             Expected output:   [^moveFromLocal: `hdfs://\w+[.a-z]*:[0-9]+/wrongdir': No such file or directory]
          2011-08-08 23:42:13,025 INFO  cli.CLITestHelper (CLITestHelper.java:displayResults(184)) -               Actual output:   [moveFromLocal: `hdfs://127.0.0.1:50399/wrongdir': No such file or directory]
          
        • TestHDFSServerPorts failed with
          Directory /test/dfs/namesecondary is in an inconsistent state: checkpoint directory does not exist or is not accessible.
          org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /test/dfs/namesecondary is in an inconsistent state: checkpoint directory does not exist or is not accessible.
          	at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.recoverCreate(SecondaryNameNode.java:801)
          
        • TestReplicationPolicy, TestCheckpoint, TestNNThroughputBenchmark, TestValidateConfigurationSettings failed with
          Cannot create directory /test/dfs/name/current
          java.io.IOException: Cannot create directory /test/dfs/name/current
          	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.clearDirectory(Storage.java:276)
          	at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorage.java:492)
          	...
          

        The failures are not related to this.

        Show
        Tsz Wo Nicholas Sze added a comment - Ran unit tests. The following failed. TestHDFSCLI expected hostname (letters and dots) but my machine used 127.0.0.1. E.g. 2011-08-08 23:42:13,024 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(182)) - Expected output: [^moveFromLocal: `hdfs://\w+[.a-z]*:[0-9]+/wrongdir': No such file or directory] 2011-08-08 23:42:13,025 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(184)) - Actual output: [moveFromLocal: `hdfs://127.0.0.1:50399/wrongdir': No such file or directory] TestHDFSServerPorts failed with Directory /test/dfs/namesecondary is in an inconsistent state: checkpoint directory does not exist or is not accessible. org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /test/dfs/namesecondary is in an inconsistent state: checkpoint directory does not exist or is not accessible. at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.recoverCreate(SecondaryNameNode.java:801) TestReplicationPolicy, TestCheckpoint, TestNNThroughputBenchmark, TestValidateConfigurationSettings failed with Cannot create directory /test/dfs/name/current java.io.IOException: Cannot create directory /test/dfs/name/current at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.clearDirectory(Storage.java:276) at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorage.java:492) ... The failures are not related to this.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        I forgot to say that those tests also fail without applying the patch. I am not sure why they started failing earlier today.

        Show
        Tsz Wo Nicholas Sze added a comment - I forgot to say that those tests also fail without applying the patch. I am not sure why they started failing earlier today.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        Okay, the test failures are due to HDFS-2230. They passed (except TestHDFSCLI) after I reverted HDFS-2230 locally.

        Show
        Tsz Wo Nicholas Sze added a comment - Okay, the test failures are due to HDFS-2230 . They passed (except TestHDFSCLI) after I reverted HDFS-2230 locally.
        Hide
        Todd Lipcon added a comment -

        Even with the first patch, this didn't eliminate the cycles. Attaching a patch which does. I also removed the now-unused "dfsc" argument from LeaseRenewer.get.

        I agree this deadlock would not happen in practice, but having clean jcarder runs makes it easier to detect when we do introduce real deadlocks.

        Show
        Todd Lipcon added a comment - Even with the first patch, this didn't eliminate the cycles. Attaching a patch which does. I also removed the now-unused "dfsc" argument from LeaseRenewer.get. I agree this deadlock would not happen in practice, but having clean jcarder runs makes it easier to detect when we do introduce real deadlocks.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        Todd, thanks for checking it. You are right that it is not a bug since the synchronization is in a constructor. I agree that it is good to fix the warning though. How about using unsync-call in constructor? It requires very little code change.

        h2240_20110811.patch: unsyncSetGraceSleepPeriod(..)

        Show
        Tsz Wo Nicholas Sze added a comment - Todd, thanks for checking it. You are right that it is not a bug since the synchronization is in a constructor. I agree that it is good to fix the warning though. How about using unsync-call in constructor? It requires very little code change. h2240_20110811.patch: unsyncSetGraceSleepPeriod(..)
        Hide
        Todd Lipcon added a comment -

        sure, that makes sense. But can you remove the now-unused dfsc argument to get(...)?

        Show
        Todd Lipcon added a comment - sure, that makes sense. But can you remove the now-unused dfsc argument to get(...) ?
        Hide
        Tsz Wo Nicholas Sze added a comment -

        h2240_20110811b.patch: removed the unused dfsc.

        Show
        Tsz Wo Nicholas Sze added a comment - h2240_20110811b.patch: removed the unused dfsc.
        Hide
        Todd Lipcon added a comment -

        +1

        Show
        Todd Lipcon added a comment - +1
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12490220/h2240_20110811b.patch
        against trunk revision 1156967.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.server.blockmanagement.TestReplicationPolicy
        org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery
        org.apache.hadoop.hdfs.server.datanode.TestDataDirs
        org.apache.hadoop.hdfs.server.namenode.TestCheckpoint
        org.apache.hadoop.hdfs.server.namenode.TestGetImageServlet
        org.apache.hadoop.hdfs.server.namenode.TestINodeFile
        org.apache.hadoop.hdfs.server.namenode.TestNNLeaseRecovery
        org.apache.hadoop.hdfs.server.namenode.TestNNThroughputBenchmark
        org.apache.hadoop.hdfs.server.namenode.TestValidateConfigurationSettings
        org.apache.hadoop.hdfs.TestHDFSServerPorts

        +1 contrib tests. The patch passed contrib unit tests.

        -1 system test framework. The patch failed system test framework compile.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/1098//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/1098//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1098//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12490220/h2240_20110811b.patch against trunk revision 1156967. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.blockmanagement.TestReplicationPolicy org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery org.apache.hadoop.hdfs.server.datanode.TestDataDirs org.apache.hadoop.hdfs.server.namenode.TestCheckpoint org.apache.hadoop.hdfs.server.namenode.TestGetImageServlet org.apache.hadoop.hdfs.server.namenode.TestINodeFile org.apache.hadoop.hdfs.server.namenode.TestNNLeaseRecovery org.apache.hadoop.hdfs.server.namenode.TestNNThroughputBenchmark org.apache.hadoop.hdfs.server.namenode.TestValidateConfigurationSettings org.apache.hadoop.hdfs.TestHDFSServerPorts +1 contrib tests. The patch passed contrib unit tests. -1 system test framework. The patch failed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/1098//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/1098//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1098//console This message is automatically generated.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        The failed tests are not related. All of them were failing before the patch.

        Show
        Tsz Wo Nicholas Sze added a comment - The failed tests are not related. All of them were failing before the patch.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        I have committed this.

        Show
        Tsz Wo Nicholas Sze added a comment - I have committed this.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #832 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/832/)
        HDFS-2240. Fix a deadlock in LeaseRenewer by enforcing lock acquisition ordering.

        szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156977
        Files :

        • /hadoop/common/trunk/hdfs/CHANGES.txt
        • /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/LeaseRenewer.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #832 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/832/ ) HDFS-2240 . Fix a deadlock in LeaseRenewer by enforcing lock acquisition ordering. szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156977 Files : /hadoop/common/trunk/hdfs/CHANGES.txt /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/LeaseRenewer.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk-Commit #740 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/740/)
        HDFS-2240. Fix a deadlock in LeaseRenewer by enforcing lock acquisition ordering.

        szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156977
        Files :

        • /hadoop/common/trunk/hdfs/CHANGES.txt
        • /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/LeaseRenewer.java
        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #740 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/740/ ) HDFS-2240 . Fix a deadlock in LeaseRenewer by enforcing lock acquisition ordering. szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156977 Files : /hadoop/common/trunk/hdfs/CHANGES.txt /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/LeaseRenewer.java

          People

          • Assignee:
            Tsz Wo Nicholas Sze
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development