Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Either I am doing something incredibly stupid, or something about my environment is completely weird, or may be it really is a valid bug. I am running a NameNode deadlock consistently with 0.23 HDFS. I could never start NN successfully.

      1. h2229_20110810.patch
        4 kB
        Tsz Wo Nicholas Sze
      2. h2229_20110807.patch
        1 kB
        Tsz Wo Nicholas Sze
      3. cycle2.png
        104 kB
        Todd Lipcon
      4. cycle1.png
        56 kB
        Todd Lipcon

        Issue Links

          Activity

          Hide
          Vinod Kumar Vavilapalli added a comment -

          Thread dump:

          Found one Java-level deadlock:
          =============================
          "Thread[Thread-10,5,main]":
          waiting to lock monitor 0x08e19b8c (object 0x31b0b7f0, a org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo),
          which is held by "main"
          "main":
          waiting for ownable synchronizer 0x31641a50, (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),
          which is held by "Thread[Thread-10,5,main]"

          Java stack information for the threads listed above:
          ===================================================
          "Thread[Thread-10,5,main]":
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.isOn(FSNamesystem.java:3183)

          • waiting to lock <0x31b0b7f0> (a org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo)
            at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.isInSafeMode(FSNamesystem.java:3563)
            at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.logUpdateMasterKey(FSNamesystem.java:4523)
            at org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenSecretManager.logUpdateMasterKey(DelegationTokenSecretManager.java:279)
            at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.updateCurrentKey(AbstractDelegationTokenSecretManager.java:144)
            at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.rollMasterKey(AbstractDelegationTokenSecretManager.java:168)
            at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager$ExpiredTokenRemover.run(AbstractDelegationTokenSecretManager.java:373)
            at java.lang.Thread.run(Thread.java:619)
            "main":
            at sun.misc.Unsafe.park(Native Method)
          • parking to wait for <0x31641a50> (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
            at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)
            at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:807)
            at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeLock(FSNamesystem.java:382)
            at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processMisReplicatedBlocks(BlockManager.java:1743)
            at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.initializeReplQueues(FSNamesystem.java:3257)
          • locked <0x31b0b7f0> (a org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo)
            at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.leave(FSNamesystem.java:3228)
          • locked <0x31b0b7f0> (a org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo)
            at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.checkMode(FSNamesystem.java:3315)
            at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.setBlockTotal(FSNamesystem.java:3342)
          • locked <0x31b0b7f0> (a org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo)
            at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setBlockTotal(FSNamesystem.java:3619)
            at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.activate(FSNamesystem.java:322)
            at org.apache.hadoop.hdfs.server.namenode.NameNode.activate(NameNode.java:489)
            at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:452)
            at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:561)
            at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:553)
            at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1538)
            at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1578)

          Found 1 deadlock.

          Logs show that RPC server is started but no the http server

          ...
          2011-08-05 08:54:35,286 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Number of files = 1
          2011-08-05 08:54:35,286 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Number of files under construction = 0
          2011-08-05 08:54:35,287 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of size 113 loaded in 0 seconds.
          2011-08-05 08:54:35,287 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loaded image for txid 0 from /tmp/hdfs/hadoop/var/hdfs/name/current/fsimage_0000000000000000000
          2011-08-05 08:54:35,288 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Edits file /tmp/hdfs/hadoop/var/hdfs/name/current/edits_0000000000000000001-0000000000000000002 of size 1048576 edits # 2 loaded in 0 seconds.
          2011-08-05 08:54:35,291 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 3
          2011-08-05 08:54:35,363 INFO org.apache.hadoop.hdfs.server.namenode.NameCache: initialized with 0 entries 0 lookups
          2011-08-05 08:54:35,363 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 387 msecs
          2011-08-05 08:54:35,428 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8020
          2011-08-05 08:54:35,430 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #2 for port 8020
          2011-08-05 08:54:35,432 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #3 for port 8020
          2011-08-05 08:54:35,433 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #4 for port 8020
          2011-08-05 08:54:35,443 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Registered source RpcActivityForPort8020
          2011-08-05 08:54:35,451 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Registered source RpcDetailedActivityForPort8020
          2011-08-05 08:54:35,455 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens
          2011-08-05 08:54:35,457 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Starting expired delegation token remover thread, tokenRemoverScanInterval=60 min(s)
          2011-08-05 08:54:35,457 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens
          2011-08-05 08:54:35,457 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of blocks under construction: 0
          2011-08-05 08:54:35,457 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: initializing replication queues
          ^Stuck here

          Show
          Vinod Kumar Vavilapalli added a comment - Thread dump: Found one Java-level deadlock: ============================= "Thread [Thread-10,5,main] ": waiting to lock monitor 0x08e19b8c (object 0x31b0b7f0, a org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo), which is held by "main" "main": waiting for ownable synchronizer 0x31641a50, (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync), which is held by "Thread [Thread-10,5,main] " Java stack information for the threads listed above: =================================================== "Thread [Thread-10,5,main] ": at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.isOn(FSNamesystem.java:3183) waiting to lock <0x31b0b7f0> (a org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.isInSafeMode(FSNamesystem.java:3563) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.logUpdateMasterKey(FSNamesystem.java:4523) at org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenSecretManager.logUpdateMasterKey(DelegationTokenSecretManager.java:279) at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.updateCurrentKey(AbstractDelegationTokenSecretManager.java:144) at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.rollMasterKey(AbstractDelegationTokenSecretManager.java:168) at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager$ExpiredTokenRemover.run(AbstractDelegationTokenSecretManager.java:373) at java.lang.Thread.run(Thread.java:619) "main": at sun.misc.Unsafe.park(Native Method) parking to wait for <0x31641a50> (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:807) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeLock(FSNamesystem.java:382) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processMisReplicatedBlocks(BlockManager.java:1743) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.initializeReplQueues(FSNamesystem.java:3257) locked <0x31b0b7f0> (a org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.leave(FSNamesystem.java:3228) locked <0x31b0b7f0> (a org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.checkMode(FSNamesystem.java:3315) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.setBlockTotal(FSNamesystem.java:3342) locked <0x31b0b7f0> (a org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setBlockTotal(FSNamesystem.java:3619) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.activate(FSNamesystem.java:322) at org.apache.hadoop.hdfs.server.namenode.NameNode.activate(NameNode.java:489) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:452) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:561) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:553) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1538) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1578) Found 1 deadlock. Logs show that RPC server is started but no the http server ... 2011-08-05 08:54:35,286 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Number of files = 1 2011-08-05 08:54:35,286 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Number of files under construction = 0 2011-08-05 08:54:35,287 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of size 113 loaded in 0 seconds. 2011-08-05 08:54:35,287 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loaded image for txid 0 from /tmp/hdfs/hadoop/var/hdfs/name/current/fsimage_0000000000000000000 2011-08-05 08:54:35,288 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Edits file /tmp/hdfs/hadoop/var/hdfs/name/current/edits_0000000000000000001-0000000000000000002 of size 1048576 edits # 2 loaded in 0 seconds. 2011-08-05 08:54:35,291 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 3 2011-08-05 08:54:35,363 INFO org.apache.hadoop.hdfs.server.namenode.NameCache: initialized with 0 entries 0 lookups 2011-08-05 08:54:35,363 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 387 msecs 2011-08-05 08:54:35,428 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8020 2011-08-05 08:54:35,430 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #2 for port 8020 2011-08-05 08:54:35,432 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #3 for port 8020 2011-08-05 08:54:35,433 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #4 for port 8020 2011-08-05 08:54:35,443 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Registered source RpcActivityForPort8020 2011-08-05 08:54:35,451 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Registered source RpcDetailedActivityForPort8020 2011-08-05 08:54:35,455 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens 2011-08-05 08:54:35,457 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Starting expired delegation token remover thread, tokenRemoverScanInterval=60 min(s) 2011-08-05 08:54:35,457 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens 2011-08-05 08:54:35,457 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of blocks under construction: 0 2011-08-05 08:54:35,457 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: initializing replication queues ^Stuck here
          Hide
          Suresh Srinivas added a comment -

          This is related to block manager refactored out of FSNamesystem. Nicholas, could you please take a look at this?

          Namenode main thread grabs safemode lock and BlockManager#processMisReplicatedBlocks() takes namenode writelock
          AbstractDelegationTokenSecretManager grabs namenode writelock and then tris to grab safemode lock.

          The order in all the refactoring needs to be ensured to grab namenode lock before grabbing safemode lock.

          Show
          Suresh Srinivas added a comment - This is related to block manager refactored out of FSNamesystem. Nicholas, could you please take a look at this? Namenode main thread grabs safemode lock and BlockManager#processMisReplicatedBlocks() takes namenode writelock AbstractDelegationTokenSecretManager grabs namenode writelock and then tris to grab safemode lock. The order in all the refactoring needs to be ensured to grab namenode lock before grabbing safemode lock.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Suresh, thanks for letting me know. I will take a look.

          Show
          Tsz Wo Nicholas Sze added a comment - Suresh, thanks for letting me know. I will take a look.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          In revision 1139090 (i.e. before any blockmanagement package refactoring),

          • SafeModeInfo.initializeReplQueues() holds safemode lock and then acquires namesystem write-lock in BlockManager.processMisReplicatedBlocks().
          • FSNamesystem.logUpdateMasterKey(..) holds write-lock and then acquires safemode lock in SafeModeInfo.isOn().

          So the deadlock may not be related to the refactoring.

          Show
          Tsz Wo Nicholas Sze added a comment - In revision 1139090 (i.e. before any blockmanagement package refactoring), SafeModeInfo.initializeReplQueues() holds safemode lock and then acquires namesystem write-lock in BlockManager.processMisReplicatedBlocks(). FSNamesystem.logUpdateMasterKey(..) holds write-lock and then acquires safemode lock in SafeModeInfo.isOn(). So the deadlock may not be related to the refactoring.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          In HDFS-988, it added

             public void logUpdateMasterKey(DelegationKey key) throws IOException {
               writeLock();
               try {
          +      if (isInSafeMode()) {
          +        throw new SafeModeException(
          +          "Cannot log master key update in safe mode", safeMode);
          +      }
                 getEditLog().logUpdateMasterKey(key);
               } finally {
                 writeUnlock();
          

          It seems that this change introduced the deadlock.

          Show
          Tsz Wo Nicholas Sze added a comment - In HDFS-988 , it added public void logUpdateMasterKey(DelegationKey key) throws IOException { writeLock(); try { + if (isInSafeMode()) { + throw new SafeModeException( + "Cannot log master key update in safe mode" , safeMode); + } getEditLog().logUpdateMasterKey(key); } finally { writeUnlock(); It seems that this change introduced the deadlock.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > The order in all the refactoring needs to be ensured to grab namenode lock before grabbing safemode lock.

          Hi Suresh, the goal of refactoring was to move the code out from FSNamesystem. All changes were straightforward and all lock orders were preserved (and so I refused to fix locking problem in the refactoring previously.) Unfortunately, we did see some locking problem.

          Show
          Tsz Wo Nicholas Sze added a comment - > The order in all the refactoring needs to be ensured to grab namenode lock before grabbing safemode lock. Hi Suresh, the goal of refactoring was to move the code out from FSNamesystem. All changes were straightforward and all lock orders were preserved (and so I refused to fix locking problem in the refactoring previously.) Unfortunately, we did see some locking problem.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          h2229_20110807.patch: it is unnecessary to hold safemode lock when calling BlockManager.processMisReplicatedBlocks().

          Show
          Tsz Wo Nicholas Sze added a comment - h2229_20110807.patch: it is unnecessary to hold safemode lock when calling BlockManager.processMisReplicatedBlocks().
          Hide
          Todd Lipcon added a comment -

          I ran jcarder with the patch applied and still detected some deadlocks, attached here

          Show
          Todd Lipcon added a comment - I ran jcarder with the patch applied and still detected some deadlocks, attached here
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Todd, the detected deadlocks seem false alerts since setBlockTotal() is only invoked in activate(..) before starting the lease monitor thread.

          BTW, are you running it against trunk? The line numbers are off.

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Todd, the detected deadlocks seem false alerts since setBlockTotal() is only invoked in activate(..) before starting the lease monitor thread. BTW, are you running it against trunk? The line numbers are off.
          Hide
          Todd Lipcon added a comment -

          jcarder will detect if a lock order is inverted regardless of which threads are running - ie it can't tell if lock order A>B is used before a thread's started and B>A is used only after it's started. So, it might be a false positive, but if it's possible to make the ordering always consistent, it would be better, no?

          I'm running against trunk from last Thurs or Fri - haven't updated yet today.

          Show
          Todd Lipcon added a comment - jcarder will detect if a lock order is inverted regardless of which threads are running - ie it can't tell if lock order A>B is used before a thread's started and B>A is used only after it's started. So, it might be a false positive, but if it's possible to make the ordering always consistent, it would be better, no? I'm running against trunk from last Thurs or Fri - haven't updated yet today.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I agree that it is better to have a strict lock acquiring order. How about we fix the real deadlock first so that other people can use trunk for testing MapReduce or other systems?

          Show
          Tsz Wo Nicholas Sze added a comment - I agree that it is better to have a strict lock acquiring order. How about we fix the real deadlock first so that other people can use trunk for testing MapReduce or other systems?
          Hide
          Todd Lipcon added a comment -

          I haven't seen this deadlock before, so I don't have any way of verifying that the patch here fixes the issue. Any idea why the MR cluster launches are triggering this issue reliably but the HDFS tests are passing fine? (when they launch hundreds of miniclusters)

          Show
          Todd Lipcon added a comment - I haven't seen this deadlock before, so I don't have any way of verifying that the patch here fixes the issue. Any idea why the MR cluster launches are triggering this issue reliably but the HDFS tests are passing fine? (when they launch hundreds of miniclusters)
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Would it be related to security setting? The DelegationTokenSecretManager is activated only if UserGroupInformation.isSecurityEnabled().

          Show
          Tsz Wo Nicholas Sze added a comment - Would it be related to security setting? The DelegationTokenSecretManager is activated only if UserGroupInformation.isSecurityEnabled().
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Any idea why the MR cluster launches are triggering this issue reliably but the HDFS tests are passing fine? (when they launch hundreds of miniclusters)

          My setup is trunk code with security enabled on a real cluster (not single node).

          I haven't had time to run it again with the attached patch. Will report soon once I get around to it.

          Show
          Vinod Kumar Vavilapalli added a comment - Any idea why the MR cluster launches are triggering this issue reliably but the HDFS tests are passing fine? (when they launch hundreds of miniclusters) My setup is trunk code with security enabled on a real cluster (not single node). I haven't had time to run it again with the attached patch. Will report soon once I get around to it.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          The patch may not work. Some unit tests failed. I am not sure if they are related since these unit tests also failed in some other patches. I will try running the tests without any patch.

          Show
          Tsz Wo Nicholas Sze added a comment - The patch may not work. Some unit tests failed. I am not sure if they are related since these unit tests also failed in some other patches. I will try running the tests without any patch.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Found out that the tests failure were due to HDFS-2230. All tests except TestHDFSCLI passed in my machine after reverted HDFS-2230 locally.

          Show
          Tsz Wo Nicholas Sze added a comment - Found out that the tests failure were due to HDFS-2230 . All tests except TestHDFSCLI passed in my machine after reverted HDFS-2230 locally.
          Hide
          Todd Lipcon added a comment -

          Hi Nicholas. I'm not sure this fixes the issue - if you look at the stack trace, initializeReplQueues() is being called by leave() which is also synchronized on SafeModeInfo.

          Show
          Todd Lipcon added a comment - Hi Nicholas. I'm not sure this fixes the issue - if you look at the stack trace, initializeReplQueues() is being called by leave() which is also synchronized on SafeModeInfo.
          Hide
          Arun C Murthy added a comment -

          Guys, if this is going to take longer should we consider reverting the patch which introduced this bug? Do we know that patch?

          Show
          Arun C Murthy added a comment - Guys, if this is going to take longer should we consider reverting the patch which introduced this bug? Do we know that patch?
          Hide
          Todd Lipcon added a comment -

          It seems like FSNamesystem.activate should be writelocking itself before calling setBlockTotal, maybe? Then it will already have the write lock and be consistent with the usual order (FSN lock > SafeModeInfo lock). It also seems to make intuitive sense - activating the FSN is definitely going to be making state changes within it.

          Show
          Todd Lipcon added a comment - It seems like FSNamesystem.activate should be writelocking itself before calling setBlockTotal, maybe? Then it will already have the write lock and be consistent with the usual order (FSN lock > SafeModeInfo lock). It also seems to make intuitive sense - activating the FSN is definitely going to be making state changes within it.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > ... initializeReplQueues() is being called by leave() which is also synchronized on SafeModeInfo.

          It is the case, see this comment.

          Show
          Tsz Wo Nicholas Sze added a comment - > ... initializeReplQueues() is being called by leave() which is also synchronized on SafeModeInfo. It is the case, see this comment .
          Hide
          Tsz Wo Nicholas Sze added a comment -

          @Arun, it was introduced by HDFS-988; see this comment.

          Show
          Tsz Wo Nicholas Sze added a comment - @Arun, it was introduced by HDFS-988 ; see this comment .
          Hide
          Todd Lipcon added a comment -

          Rather than fully revert 988, I'd prefer to simply comment out the safemode check in logUpdateMasterKey – which I think will workaround this issue. But, I think my above suggestion about taking the write lock during activate is more correct

          Show
          Todd Lipcon added a comment - Rather than fully revert 988, I'd prefer to simply comment out the safemode check in logUpdateMasterKey – which I think will workaround this issue. But, I think my above suggestion about taking the write lock during activate is more correct
          Hide
          Tsz Wo Nicholas Sze added a comment -

          @Todd, it is probably okay to hold the write lock in activate(..).

          Let's change setBlockTotal in a separated issue since it is not indeed a bug.

          Show
          Tsz Wo Nicholas Sze added a comment - @Todd, it is probably okay to hold the write lock in activate(..). Let's change setBlockTotal in a separated issue since it is not indeed a bug.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Oops, I missed you point Todd. My patch may not fix the bug.

          Show
          Tsz Wo Nicholas Sze added a comment - Oops, I missed you point Todd. My patch may not fix the bug.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          h2229_20110810.patch: acquire writelock in activate.

          Show
          Tsz Wo Nicholas Sze added a comment - h2229_20110810.patch: acquire writelock in activate.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12490061/h2229_20110810.patch
          against trunk revision 1156424.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.server.blockmanagement.TestReplicationPolicy
          org.apache.hadoop.hdfs.server.namenode.TestCheckpoint
          org.apache.hadoop.hdfs.server.namenode.TestNNThroughputBenchmark
          org.apache.hadoop.hdfs.server.namenode.TestValidateConfigurationSettings
          org.apache.hadoop.hdfs.TestHDFSServerPorts

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/1083//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/1083//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1083//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12490061/h2229_20110810.patch against trunk revision 1156424. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.blockmanagement.TestReplicationPolicy org.apache.hadoop.hdfs.server.namenode.TestCheckpoint org.apache.hadoop.hdfs.server.namenode.TestNNThroughputBenchmark org.apache.hadoop.hdfs.server.namenode.TestValidateConfigurationSettings org.apache.hadoop.hdfs.TestHDFSServerPorts +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/1083//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/1083//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1083//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          +1. All of those failing tests have been failing since the common mavenization (I think Alejandro is working on it)

          I also re-ran the jcarder test and verified that the cycle is no longer listed.

          Show
          Todd Lipcon added a comment - +1. All of those failing tests have been failing since the common mavenization (I think Alejandro is working on it) I also re-ran the jcarder test and verified that the cycle is no longer listed.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Thanks Todd for the review.

          I have committed this.

          Show
          Tsz Wo Nicholas Sze added a comment - Thanks Todd for the review. I have committed this.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #731 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/731/)
          HDFS-2229. Fix a deadlock in namenode by enforcing lock acquisition ordering.

          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156847
          Files :

          • /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/trunk/hdfs/CHANGES.txt
          • /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #731 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/731/ ) HDFS-2229 . Fix a deadlock in namenode by enforcing lock acquisition ordering. szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156847 Files : /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hdfs/CHANGES.txt /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #826 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/826/)
          HDFS-2229. Fix a deadlock in namenode by enforcing lock acquisition ordering.

          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156847
          Files :

          • /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/trunk/hdfs/CHANGES.txt
          • /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #826 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/826/ ) HDFS-2229 . Fix a deadlock in namenode by enforcing lock acquisition ordering. szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156847 Files : /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hdfs/CHANGES.txt /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java

            People

            • Assignee:
              Tsz Wo Nicholas Sze
              Reporter:
              Vinod Kumar Vavilapalli
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development