Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.18.3
    • Fix Version/s: 0.18.3
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      SafeModeInfo.leave() acquires locks in an incorrect order, which causes the deadlock.
      It first acquires the SafeModeInfo lock, then calls FSNamesystem.processMisReplicatedBlocks(), which requires the global FSNamesystem lock.
      It should be the other way around: first FSNamesystem lock, then SafeModeInfo.

      1. safeModeDeadlock.patch
        1 kB
        Konstantin Shvachko
      2. safeModeDeadlock-0-18.patch
        1 kB
        Konstantin Shvachko
      3. safeModeDeadlock-0-18.patch
        1 kB
        Konstantin Shvachko

        Activity

        Hide
        Konstantin Shvachko added a comment - - edited

        Thanks Koji for detecting this.
        Here's part of jstack trace.

        "org.apache.hadoop.dfs.FSNamesystem$SafeModeMonitor@2b7f6b6d":
         at org.apache.hadoop.dfs.FSNamesystem.processMisReplicatedBlocks(FSNamesystem.java:2918)
         - waiting to lock <0x0000002ada38f558> (a org.apache.hadoop.dfs.FSNamesystem)
         at org.apache.hadoop.dfs.FSNamesystem.access$800(FSNamesystem.java:72)
         at org.apache.hadoop.dfs.FSNamesystem$SafeModeInfo.leave(FSNamesystem.java:3833)
         - locked <0x0000002d34fb4c80> (a org.apache.hadoop.dfs.FSNamesystem$SafeModeInfo)
         at org.apache.hadoop.dfs.FSNamesystem$SafeModeMonitor.run(FSNamesystem.java:4033)
         at java.lang.Thread.run(Thread.java:619)
        
        "IPC Server handler 38 on 8020":
         at org.apache.hadoop.dfs.FSNamesystem$SafeModeInfo.isOn(FSNamesystem.java:3796)
         - waiting to lock <0x0000002d34fb4c80> (a org.apache.hadoop.dfs.FSNamesystem$SafeModeInfo)
         at org.apache.hadoop.dfs.FSNamesystem.isInSafeMode(FSNamesystem.java:4068)
         at org.apache.hadoop.dfs.FSNamesystem.addStoredBlock(FSNamesystem.java:2820)
         - locked <0x0000002ada38f558> (a org.apache.hadoop.dfs.FSNamesystem)
         at org.apache.hadoop.dfs.FSNamesystem.processReport(FSNamesystem.java:2718)
         - locked <0x0000002ada38f558> (a org.apache.hadoop.dfs.FSNamesystem)
         at org.apache.hadoop.dfs.NameNode.blockReport(NameNode.java:613)
         at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
         at java.lang.reflect.Method.invoke(Method.java:597)
         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
        
        Found 1 deadlock.
        
        Show
        Konstantin Shvachko added a comment - - edited Thanks Koji for detecting this. Here's part of jstack trace. "org.apache.hadoop.dfs.FSNamesystem$SafeModeMonitor@2b7f6b6d" : at org.apache.hadoop.dfs.FSNamesystem.processMisReplicatedBlocks(FSNamesystem.java:2918) - waiting to lock <0x0000002ada38f558> (a org.apache.hadoop.dfs.FSNamesystem) at org.apache.hadoop.dfs.FSNamesystem.access$800(FSNamesystem.java:72) at org.apache.hadoop.dfs.FSNamesystem$SafeModeInfo.leave(FSNamesystem.java:3833) - locked <0x0000002d34fb4c80> (a org.apache.hadoop.dfs.FSNamesystem$SafeModeInfo) at org.apache.hadoop.dfs.FSNamesystem$SafeModeMonitor.run(FSNamesystem.java:4033) at java.lang. Thread .run( Thread .java:619) "IPC Server handler 38 on 8020" : at org.apache.hadoop.dfs.FSNamesystem$SafeModeInfo.isOn(FSNamesystem.java:3796) - waiting to lock <0x0000002d34fb4c80> (a org.apache.hadoop.dfs.FSNamesystem$SafeModeInfo) at org.apache.hadoop.dfs.FSNamesystem.isInSafeMode(FSNamesystem.java:4068) at org.apache.hadoop.dfs.FSNamesystem.addStoredBlock(FSNamesystem.java:2820) - locked <0x0000002ada38f558> (a org.apache.hadoop.dfs.FSNamesystem) at org.apache.hadoop.dfs.FSNamesystem.processReport(FSNamesystem.java:2718) - locked <0x0000002ada38f558> (a org.apache.hadoop.dfs.FSNamesystem) at org.apache.hadoop.dfs.NameNode.blockReport(NameNode.java:613) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) Found 1 deadlock.
        Hide
        Konstantin Shvachko added a comment -

        This should solve the problem. leaveSafeMode() first acquires the FSNamesystem lock and then the SafeMode lock.

        Show
        Konstantin Shvachko added a comment - This should solve the problem. leaveSafeMode() first acquires the FSNamesystem lock and then the SafeMode lock.
        Hide
        Konstantin Shvachko added a comment -

        Patch for 0.18

        Show
        Konstantin Shvachko added a comment - Patch for 0.18
        Hide
        Tsz Wo Nicholas Sze added a comment -

        +1 patch looks good.

             [exec] -1 overall.  
             [exec] 
             [exec]     +1 @author.  The patch does not contain any @author tags.
             [exec] 
             [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
             [exec]                         Please justify why no tests are needed for this patch.
             [exec] 
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
             [exec] 
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec] 
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
             [exec] 
             [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
        
        Show
        Tsz Wo Nicholas Sze added a comment - +1 patch looks good. [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
        Hide
        Konstantin Shvachko added a comment -

        I ran tests. Only one failed: TestMapReduceLocal. This is related to HADOOP-4907.

        Show
        Konstantin Shvachko added a comment - I ran tests. Only one failed: TestMapReduceLocal. This is related to HADOOP-4907 .
        Hide
        Konstantin Shvachko added a comment -

        I just committed this.

        Show
        Konstantin Shvachko added a comment - I just committed this.
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #698 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/698/ )

          People

          • Assignee:
            Konstantin Shvachko
            Reporter:
            Konstantin Shvachko
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development