Issue Details (XML | Word | Printable)

Key: HADOOP-4517
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Tsz Wo (Nicholas), SZE
Reporter: Christian Kunz
Votes: 0
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

unstable dfs when running jobs on 0.18.1

Created: 24/Oct/08 07:16 PM   Updated: 08/Jul/09 04:43 PM
Return to search
Component/s: None
Affects Version/s: 0.18.1
Fix Version/s: 0.18.2

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 4517_20081024.patch 2008-10-24 10:01 PM Tsz Wo (Nicholas), SZE 3 kB
Text File Licensed for inclusion in ASF works 4517_20081024b_0.18.patch 2008-10-24 10:27 PM Tsz Wo (Nicholas), SZE 3 kB
Text File Licensed for inclusion in ASF works 4517_20081024c_0.18.patch 2008-10-25 12:26 AM Tsz Wo (Nicholas), SZE 3 kB
Text File Licensed for inclusion in ASF works 4517_20081024d.patch 2008-10-25 05:27 PM Tsz Wo (Nicholas), SZE 3 kB
Text File Licensed for inclusion in ASF works 4517_20081024d_0.18.patch 2008-10-25 01:44 AM Tsz Wo (Nicholas), SZE 3 kB
Text File Licensed for inclusion in ASF works 4517_20081027.patch 2008-10-27 07:02 PM Tsz Wo (Nicholas), SZE 4 kB
Text File Licensed for inclusion in ASF works 4517_20081027_0.18.patch 2008-10-27 08:43 PM Tsz Wo (Nicholas), SZE 4 kB
File Licensed for inclusion in ASF works datanode.out 2008-10-24 07:18 PM Christian Kunz 301 kB
Environment:

Hadoop Flags: Reviewed
Resolution Date: 28/Oct/08 12:07 AM


 Description  « Hide
2 attempts of a job using 6000 maps, 1900 reduces

1.st attempt: failed during reduce phase after 22 hours with 31 dead datanodes most of which became unresponsive due to an exception; dfs lost blocks
2nd attempt: failed during map phase after 5 hours with 5 dead datanodes due to exception; dfs lost blocks responsible for job failure.

I will post typical datanode exception and attach thread dump.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Christian Kunz added a comment - 24/Oct/08 07:18 PM
Datanode Exception:
2008-10-23 17:30:21,703 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_4622591820056312866_6335826 received exception java.io.EOFException: while trying to read 65625 bytes
2008-10-23 17:30:21,703 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(xxx.yyy.zzz.uuu:50010, storageID=DS-1873593537-xx.yyy.zzz.uuu-vvvv-1216412543041, infoPort=50075, ipcPort=50020):DataXceiver: java.io.EOFException: while trying to read 65625 bytes
at org.apache.hadoop.dfs.DataNode$BlockReceiver.readToBuf(DataNode.java:2464)
at org.apache.hadoop.dfs.DataNode$BlockReceiver.readNextPacket(DataNode.java:2508)
at org.apache.hadoop.dfs.DataNode$BlockReceiver.receivePacket(DataNode.java:2572)
at org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:2698)
at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1283)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1045)
at java.lang.Thread.run(Thread.java:619)

Christian Kunz added a comment - 24/Oct/08 07:21 PM
Is it possible that this is a configuration issue?

Main concern is, of course, the loss of data blocks. So far they could be recovered by restarting the dead datanodes.


Raghu Angadi added a comment - 24/Oct/08 07:57 PM
When was the jstack you attached taken? It looks like following thread is blocking most of the other threads : {nofomat}
"IPC Server handler 8 on 50020" daemon prio=10 tid=0x085ca800 nid=0x6a81 in Object.wait() [0xb0d76000..0xb0d76e20]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1143)
- locked <0xb8be7008> (a org.apache.hadoop.util.Daemon)
at java.lang.Thread.join(Thread.java:1196)
at org.apache.hadoop.dfs.FSDataset.interruptOngoingCreates(FSDataset.java:777)
- locked <0xb64e9c20> (a org.apache.hadoop.dfs.FSDataset)
at org.apache.hadoop.dfs.FSDataset.updateBlock(FSDataset.java:795)
- locked <0xb64e9c20> (a org.apache.hadoop.dfs.FSDataset)
at org.apache.hadoop.dfs.DataNode.updateBlock(DataNode.java:3106)
at org.apache.hadoop.dfs.LeaseManager.syncBlock(LeaseManager.java:503)
at org.apache.hadoop.dfs.LeaseManager.recoverBlock(LeaseManager.java:471)
at org.apache.hadoop.dfs.DataNode.recoverBlock(DataNode.java:3134)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888){nofomat}

Not sure if this is the cause or a side effect. I wonder why there are so many blocks under recovery.


Christian Kunz added a comment - 24/Oct/08 08:27 PM
The thread dump was taken long time (about 10 hrs) after the last log message containing above exception for this datanode.

In general, from what I observed for the whole job there was an unsually high number of write errors including reduce task failures compared to 0.17.2. In the 12 hours leading up to the last exception there were 300+ exceptions like above on this datanode alone. I checked a datanode which did not become dead. It showed similar order of magnitude of exceptions.


Raghu Angadi added a comment - 24/Oct/08 08:40 PM
Seems like a deadlock in Datanode. This essentially freezes the datanode and eventually it will marked dead.

The deadlock is between the threads nid=0xf05 (A) and nid=0x6a81 (B). B is waiting for A to exit, but A is trying to acquire a lock held by B. It seems to be new append related code in 0.18. I don't think the exceptions are related.

  • A :
     "org.apache.hadoop.dfs.DataNode$DataXceiver@1ccd900" daemon prio=10 tid=0x09368000 nid=0xf05 waiting for monitor entry [0x9bb21000..0x9bb21ea0]
       java.lang.Thread.State: BLOCKED (on object monitor)
    	at org.apache.hadoop.dfs.FSDataset.setChannelPosition(FSDataset.java:977)
    	- waiting to lock <0xb64e9c20> (a org.apache.hadoop.dfs.FSDataset)
    	at org.apache.hadoop.dfs.DataNode$BlockReceiver.setBlockPosition(DataNode.java:2791)
    	at org.apache.hadoop.dfs.DataNode$BlockReceiver.receivePacket(DataNode.java:2596)
    	at org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:2698)
    	at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1283)
    	at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1045)
    	at java.lang.Thread.run(Thread.java:619)
    
  • B :
    "org.apache.hadoop.dfs.DataNode$DataXceiver@1ccd900" daemon prio=10 tid=0x09368000 nid=0xf05 waiting for monitor entry [0x9bb21000..0x9bb21ea0]
       java.lang.Thread.State: BLOCKED (on object monitor)
    	at org.apache.hadoop.dfs.FSDataset.setChannelPosition(FSDataset.java:977)
    	- waiting to lock <0xb64e9c20> (a org.apache.hadoop.dfs.FSDataset)
    	at org.apache.hadoop.dfs.DataNode$BlockReceiver.setBlockPosition(DataNode.java:2791)
    	at org.apache.hadoop.dfs.DataNode$BlockReceiver.receivePacket(DataNode.java:2596)
    	at org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:2698)
    	at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1283)
    	at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1045)
    	at java.lang.Thread.run(Thread.java:619)
    

Tsz Wo (Nicholas), SZE added a comment - 24/Oct/08 08:41 PM
> It looks like following thread is blocking most of the other threads ...

This seems a deadlock: Server handler 8 is holding the lock 0xb64e9c20 but waiting for some thread t to complete. I guess the thread t is blocked by same lock 0xb64e9c20.


Raghu Angadi added a comment - 24/Oct/08 08:45 PM
Yet another case that shows why interrupting a thread for thread communication is not recommended.

Raghu Angadi added a comment - 24/Oct/08 09:01 PM
I should note that thread 'A' above need not be nid=0xf05, it could be one of many other threads.

Tsz Wo (Nicholas), SZE added a comment - 24/Oct/08 10:01 PM
4517_20081024.patch: release the lock when calling join().

Tsz Wo (Nicholas), SZE added a comment - 24/Oct/08 10:27 PM
4517_20081024b_0.18.patch: for 0.18

Christian, could you try the patch and see whether it works?


Tsz Wo (Nicholas), SZE added a comment - 25/Oct/08 12:26 AM
4517_20081024c_0.18.patch: refactored the codes.

Tsz Wo (Nicholas), SZE added a comment - 25/Oct/08 01:44 AM
4517_20081024d_0.18.patch: fixed a bug.

Tsz Wo (Nicholas), SZE added a comment - 25/Oct/08 05:27 PM
4517_20081024d.patch: for trunk
     [exec] -1 overall.  

     [exec]     +1 @author.  The patch does not contain any @author tags.

     [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
     [exec]                         Please justify why no tests are needed for this patch.

     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.

     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.

     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

Passed all unit tests locally. Don't know how to add new tests for this.


Robert Chansler added a comment - 27/Oct/08 04:23 PM
Dhruba: Will you take a peek at this? Given the difficulty of doing a test, another person's inspection would be helpful!

Raghu Angadi added a comment - 27/Oct/08 05:50 PM

Looks good. Couple of rename suggestions :

  • {{ getOngoingCreatesOrUpdateBlock()}} could be called tryUpdateBlock(). I think JavaDoc should be better.
  • findAndUpdatedBlock() could be folded in to tryUpdateBlock().

I think it will be better to get a confirmation from Christian's load before committing it.. or at least before 0.18.2 is released.


dhruba borthakur added a comment - 27/Oct/08 06:45 PM
Had a offline conversation with Nicholas. The latest patch could run into a problem if there is a race with a new competing writer. The activeThreads cannot be cleared before joining them. Nicholas will probably put up a slightly modified patch.

Christian Kunz added a comment - 27/Oct/08 06:59 PM
Sorry about the late response.

As a matter of fact, before the patch was available (although it was done very fast, thank you, Nicholas), the 3rd attempt was successful because I restarted DataNodes periodically.
Our batch of jobs continued without the patch with no problem, indicating that the dead-lock situation is not easily reproducible. It seems to require a much higher number of simultaneous writing clients than Datanode servers.
The specific job doing this kind of intensive writing will be re-executed in about 3 days. Then I will try out the patch. Can you wait so long?


Tsz Wo (Nicholas), SZE added a comment - 27/Oct/08 07:02 PM
4517_20081027.patch:
  • only remove dead thread instead of all threads.
  • re-arranged the codes according to Raghu's comments.

Thanks Raghu and Dhruba for reviewing.


dhruba borthakur added a comment - 27/Oct/08 08:41 PM
This looks good. I do not know of a unit test than can trigger this race. It might be tough to trigger this race in a standalone unit test.

Tsz Wo (Nicholas), SZE added a comment - 27/Oct/08 08:43 PM
4517_20081027_0.18.patch: for 0.18

Raghu Angadi added a comment - 27/Oct/08 10:00 PM
+1.

Tsz Wo (Nicholas), SZE added a comment - 27/Oct/08 10:10 PM
     [exec] -1 overall.  

     [exec]     +1 @author.  The patch does not contain any @author tags.

     [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
     [exec]                         Please justify why no tests are needed for this patch.

     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.

     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.

     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

I am running unit tests locally. Will commit them once they passed the tests.


Tsz Wo (Nicholas), SZE added a comment - 28/Oct/08 12:07 AM
I just committed this.

Hudson added a comment - 28/Oct/08 01:56 PM
Integrated in Hadoop-trunk #645 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/645/)
. Release FSDataset lock before joining ongoing create threads. (szetszwo)