Issue Details (XML | Word | Printable)

Key: HADOOP-4904
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Konstantin Shvachko
Reporter: Konstantin Shvachko
Votes: 0
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Deadlock while leaving safe mode.

Created: 17/Dec/08 05:20 PM   Updated: 08/Jul/09 04:43 PM
Return to search
Component/s: None
Affects Version/s: 0.18.3
Fix Version/s: 0.18.3

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works safeModeDeadlock-0-18.patch 2008-12-18 01:01 AM Konstantin Shvachko 1 kB
Text File Licensed for inclusion in ASF works safeModeDeadlock-0-18.patch 2008-12-18 12:57 AM Konstantin Shvachko 1 kB
Text File Licensed for inclusion in ASF works safeModeDeadlock.patch 2008-12-17 10:45 PM Konstantin Shvachko 1 kB

Hadoop Flags: Reviewed
Resolution Date: 19/Dec/08 02:38 AM


 Description  « Hide
SafeModeInfo.leave() acquires locks in an incorrect order, which causes the deadlock.
It first acquires the SafeModeInfo lock, then calls FSNamesystem.processMisReplicatedBlocks(), which requires the global FSNamesystem lock.
It should be the other way around: first FSNamesystem lock, then SafeModeInfo.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Konstantin Shvachko added a comment - 17/Dec/08 05:25 PM - edited
Thanks Koji for detecting this.
Here's part of jstack trace.
"org.apache.hadoop.dfs.FSNamesystem$SafeModeMonitor@2b7f6b6d":
 at org.apache.hadoop.dfs.FSNamesystem.processMisReplicatedBlocks(FSNamesystem.java:2918)
 - waiting to lock <0x0000002ada38f558> (a org.apache.hadoop.dfs.FSNamesystem)
 at org.apache.hadoop.dfs.FSNamesystem.access$800(FSNamesystem.java:72)
 at org.apache.hadoop.dfs.FSNamesystem$SafeModeInfo.leave(FSNamesystem.java:3833)
 - locked <0x0000002d34fb4c80> (a org.apache.hadoop.dfs.FSNamesystem$SafeModeInfo)
 at org.apache.hadoop.dfs.FSNamesystem$SafeModeMonitor.run(FSNamesystem.java:4033)
 at java.lang.Thread.run(Thread.java:619)

"IPC Server handler 38 on 8020":
 at org.apache.hadoop.dfs.FSNamesystem$SafeModeInfo.isOn(FSNamesystem.java:3796)
 - waiting to lock <0x0000002d34fb4c80> (a org.apache.hadoop.dfs.FSNamesystem$SafeModeInfo)
 at org.apache.hadoop.dfs.FSNamesystem.isInSafeMode(FSNamesystem.java:4068)
 at org.apache.hadoop.dfs.FSNamesystem.addStoredBlock(FSNamesystem.java:2820)
 - locked <0x0000002ada38f558> (a org.apache.hadoop.dfs.FSNamesystem)
 at org.apache.hadoop.dfs.FSNamesystem.processReport(FSNamesystem.java:2718)
 - locked <0x0000002ada38f558> (a org.apache.hadoop.dfs.FSNamesystem)
 at org.apache.hadoop.dfs.NameNode.blockReport(NameNode.java:613)
 at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)

Found 1 deadlock.

Konstantin Shvachko added a comment - 17/Dec/08 10:45 PM
This should solve the problem. leaveSafeMode() first acquires the FSNamesystem lock and then the SafeMode lock.

Konstantin Shvachko added a comment - 18/Dec/08 12:57 AM
Patch for 0.18

Tsz Wo (Nicholas), SZE added a comment - 19/Dec/08 12:44 AM
+1 patch looks good.
     [exec] -1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
     [exec]                         Please justify why no tests are needed for this patch.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

Konstantin Shvachko added a comment - 19/Dec/08 12:49 AM
I ran tests. Only one failed: TestMapReduceLocal. This is related to HADOOP-4907.

Konstantin Shvachko added a comment - 19/Dec/08 02:38 AM
I just committed this.

Hudson added a comment - 22/Dec/08 03:15 PM