|
Stacktrace of a datanode that lost contact with the namenode. Not sure it helps but doesn't hurt.
Found 68 of these when doing a stackdump in a datanode that had lost contact with the namenode.
"org.apache.hadoop.dfs.DataNode$DataXceiver@5b02a6" daemon prio=10 tid=0x7192bc00 nid=0x5bc6 waiting for monitor entry [0x70cee000..0x70cef0c0]
> Found 68 of these when doing a stackdump in a datanode that had lost contact with the namenode.
Could you attach the full stack trace of this datanode? While you are at it, could you attach log (.log file) for this datanode as well? The log file show what activity is going on now. Also list approximate time the stack trace was taken. My observation is that you are writing a lot of blocks.. and the datanode that looks blocked is blocked while listing all the blocks on the native filesystem. It does this every hour when it sends block reports. Till now nothing suspicious other than heavy write traffic and slow disks. Check iostat on the machine. What is the hardware like? The two main threads from the DataNode are :
Is is there any change in load on the cluster compared to when it was running 0.15?
Here's the log file for that machine.
Hardware: 8 cores @ 1.86ghz, 8gb ram, 4x750gb 7200 prm disks with gigabit ethernet connections. Example iostat from a node that just lost contact with the namenode. avg-cpu: %user %nice %sys %iowait %idle Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util Johan, you can enclose the formatted text inside {noformat} ... {noformat} so that it is easier to read.
What is the exact iostat command you ran? Is it an average over last 5 seconds or so or overall average?
From the data 2008-04-10 09:23:08,667 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 497572 blocks got processed in 381086 msecs A few things about this:
Though DataNode should be able to handle larger number of blocks better, I don't think this is a blocker for 0.16.3 release (unless iostat shows something else). Do you agree? Btw, are you planning to have large number of small blocks? Its going to limit NameNode scalability. I am currently marking this for 0.18. 0.16.3 is close to be released and looks like any fix for this would not be trivial. Also it does not seem like a regression yet. But if it turns out to be different, we can make this a candidate for 0.16.4 or 0.17.
Here's the second output from iostat -x 30 from a datanode that just lost contact.
avg-cpu: %user %nice %sys %iowait %idle
94.26 0.00 0.83 0.00 4.91
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.40 0.20 137.45 5.10 2031.86 2680.44 1015.93 1340.22 33.06 1.93 13.53 6.72 95.73
sdb 0.23 0.03 0.33 1.20 145.02 552.22 72.51 276.11 454.87 0.02 10.22 4.78 0.73
sdc 0.50 0.03 0.70 8.76 307.10 7137.72 153.55 3568.86 786.69 4.70 496.94 6.90 6.53
sdd 0.47 0.07 0.83 0.53 315.63 4.83 157.81 2.42 234.56 0.06 44.39 10.73 1.47
I'm aware that we have quite a lot of small files, it's an issue we're working on. I guess we'll have to ramp up the priority. Perhaps the difference between 0.15 and 0.16 was just that we did hit it quite hard after the upgrade as data had queued up, We've reduced the number of blocks per node to about 150 000, but we're still seeing the same problem as before.
can you check messages similar to "[...]BlockReport of 497572 blocks got processed in 381086 msecs" on such data nodes?
One thing I am not sure yet is why your sda is so much busier than other disks.. somehow all the native filesystem metadata falls on one disk? Another possibility is swapping but size of each is very small (~512 bytes).
> "DataNode: [/var/storage/1/dfs/data,/var/storage/2/dfs/data,/var/storage/3/dfs/data,/var/storage/4/dfs/data]"
is each of /var/storage/[1-4] mounted on different disk or all are on sda? All those directories are on separate disks.
One thing I didn't think about was that even though we merged a lot of files to reduce the number of blocks there would still be tons of files in dfs/data/previous, since I had not run -finalizeUpgrade. Although having the datanodes lose contact with the namenode because it's checking disk usage seems like quite a serious bug to me. not sure why sda is more busy, although that is where the logs are located Yes, looks like DU goes through previous also... So instead of both block report and DU causing problems now it is DU..
Can you clarify if "datanodes lose contact" means NameNode actually marks them "dead"? > Although having the datanodes lose contact with the namenode because it's checking disk usage seems like quite a serious bug to me. I agree. Doing these in the background without blocking normal DataNode functions takes a little bit of restructuring. We should keep this jira open. > not sure why sda is more busy, although that is where the logs are located Created a simple patch that makes DU run the shell command in a new thread and never block on getUsage(). It does change the behavior a bit, but in most cases it shouldn't be a problem.
I've not had a chance to test this on a real cluster yet, works in my local testing though. Unfortunately it turns out this only solves part of the problem. Block reports are still an issue. Regd block reports, one option is not to do the fs scan at all.
> Block reports are still an issue.
Will incremental block reports (HADOOP-1079) resolve this? > Will incremental block reports (HADOOP-1079) resolve this?
If it avoids scanning the file system, yes. From the above jira it is not clear if it will avoid comparing in-memory map and on-disk blocks. Even as simple a change as commenting out fs scan is enough for this problem, I think. If I understand this correctly the purpose of a block report is to let the namenode know if a block mysteriously goes missing on the data node.
Would it make sense to have this as part of the block verification mechanism described in Of course just commenting out the fs scan as Raghu suggest would also work, how big is the issue of blocks going missing? Shall we fix the DU issue first, it seems easier. Could someone review my patch, please?
Then we can create another ticket for the block report problem. regd the patch,
> It does change the behavior a bit, but in most cases it shouldn't be a problem. I haven't looked at it properly yet, could you describe what the change in behavior is? Also not sure why it needs to change Shell stuff. Could the desired behavor for DF be implemented in DF class? > Also not sure why it needs to change Shell stuff. [ ... ]
I think this was because Johan wanted to make DF implement Runnable, and there was a conflict, since Shell already has a method named 'run'. But this changes public APIs incompatibly, and is thus not a good approach. Johan, perhaps instead we could define a nested class in DF.java that extends Thread and overrides run() there, or implements Runnable, if you prefer. Also the default interval is dfs.blockreport.intervalMsec, which seems rather long. It should really be related to the heartbeat interval, no? Moreover, we shouldn't use a DFS parameter in a generic FS class. So default interval should be something safe, perhaps hardwired to 10 minutes or somesuch, and DFS should override that when it constructs a DF, if it needs. Does that make sense? And perhaps the thread should run 'df' first, then sleep, so that values are available to clients sooner? Finally, and most imporant, what evidence do you have that DU is in fact causing problems? It is run very infrequently, not strictly synchronized with block reports but rather triggered by heartbeats. A slow DU would thus result in a delayed heartbeat. None of the stack traces above indicate that DU is blocking other activities of the datanode. Cleaned up the patch a bit, now only changes DU.java + the test
Doug: You're right about the Runnable/run() bit, just as you wrote that I adapted the patch as you suggested.
I agree about the interval, I'll change it. In the DU constructor the command is run once so that we get values straight away, I thought this would be better since then we know for sure there's correct values in there once the object is created. I'll try to recreate the situation to produce good evidence, but off the top of my head DU is used to decide what volume to write to in writeToBlock in FSDataset, so it causes problems with writing blocks if it takes too long. We've seen quite a lot of this. This addresses the first of my above concerns, but not the other three.
Also, does that test in fact succeed? It seems to me that the thread will not have yet had a chance to update its usage count before the assertEquals(), no? Oops. I posted my previous comment before I saw your last comment. You're right, I was confusing DU and DF. And I missed the refresh in the ctor. Sorry! That resolves most of my concerns. Since this is called much more frequently than I was thinking, it makes sense for it not to be synchronous. I think my only remaining concern is the default interval, which you've said you'd address. Thanks!
It was my mistake to say 'DF' where I meant 'DU'. Patch looks good. Couple of comments :
Are you using this (or prev) patch in your environment? This certainly improves DN stability with large number of blocks. We still need to keep in mind that DU has very noticeable penalty. Say it takes around 10min (as in your case)... then it implies 15% of the time DN will be extremely I/O starved. This will have very noticeable affect on I/O intensive applications. Updated patch with the suggestions from Doug and Raghu.
Interval defaults to 10min. If the incoming interval is 0 the previous behavior is used. Findbugs doesn't like that I start a thread in the constructor, but afaik it's the only way without adding a start method to the class and I assume you don't want to change the interface. Raghu: I am using a previous patch on our cluster yes, no problems so far. > I'm not sure what you mean by not having a permanent thread. How would we update the value without blocking on getUsed in that case?
This thread will stay idle pretty much most of the time. So we could start a thread inside getUsed() (and possibly in other accessor methods if interval has passed) and make the thread exit after running du. This is no less accurate than current implementation. Or you could schedule a periodic thread using Java 'Executor'. Even if you do keep the persistent thread, could you add comment if you agree that it need not be persistent.. we might implement that later. Regd the patch:
1. Changed
2. Removed the reference to DU and call DU.this.run() instead. Scheduling it with an Executor would still have a running thread that keeps track of the scheduling, no? ScheduledThreadPoolExecutor for example. Anyway, I added the comment about improving this with a non permanent thread anyway, in the most common case I guess that would work fine. As mentioned this patch will cause one new findbugs error, starting a thread in the constructor, hard to avoid without having a new method to start it, breaking the public interface. > Scheduling it with an Executor would still have a running thread that keeps track of the scheduling, no? DU looks like simple utility to users. If they do create these multiple times, each when ever they need, they will be surprised to see threads hanging around. In that sense, it might be better to add a "start()" method so that it is explicit to them that a thread might be started and it needs to be shutdown(). +1 over all. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12382039/du-nonblocking-v5-trunk.patch against trunk revision 656270. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2467/testReport/ This message is automatically generated. Actually requiring start() is not so bad. Also DataNodes could invoke shutdown() when in side its shutdown. It will also remove the findbugs warning.
Patch failed, my bad, ran all my tests with java6 so didn't catch the the @Overrides that eclipse likes to put in. It doesn't work well with java5.
Raghu: Sure, I can add a start method if you guys want one. Updated patch with start and shutdown methods. If start isn't invoked the previous behavior of running on demand will be executed.
I'm having issues running the unit tests on my machine on a clean trunk, TestDatanodeBlockScanner fails, so I hope this one passes all the tests. > TestDatanodeBlockScanner fails,..
Please update your trunk, a fix for this was committed yesterday. If it still fails on your machine, let us know. +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12382371/du-nonblocking-v6-trunk.patch against trunk revision 658862. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2512/testReport/ This message is automatically generated. +1. Thanks for multiple iterations. If a user does not call start(), the interval given will be silently ignored.. in that sense, interval could be argument to start(). But I will commit v6.path for now.
I just committed this. Thanks Johan!
Integrated in Hadoop-trunk #500 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/500/
I propose an alternate solution for this.
If the block information was managed by having a inotify task (in linux/solaris), and the windows equivalent which I forget, the datanode could be informed each time a file in the dfs tree is created, updated, or deleted. With this information being delivered, it can maintain an accurate block map with only 1 full scan of the datanode blocks, at start time. With this algorithm the data nodes will be able to scale to a much larger number of blocks. The other thing is the way the sync blocks on the FSDataset.FSVolumeSet are held totally aggravates this bug in 0.18.1. The jason@attributor.com address will be going away shortly, I will be switching to jason.hadoop@gmail.com in the next little bit. Sounds very interesting. I suggest opening a new ticket describing the solution in more detail so the discussion can continue there instead of in this closed ticket.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
08/04/10 10:09:27 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.0.5.76:50010
08/04/10 10:09:27 INFO dfs.DFSClient: Abandoning block blk_-5192866954303337577
08/04/10 10:09:27 INFO dfs.DFSClient: Waiting to find target node: 10.0.5.70:50010
Datanode log output from around the same time:
2008-04-10 10:09:15,758 INFO org.apache.hadoop.dfs.DataNode: Datanode 2 got response for connect ack from downstream datanode with firstbadlink as 10.0.5.76:50010
2008-04-10 10:09:15,758 INFO org.apache.hadoop.dfs.DataNode: Datanode 2 forwarding connect ack to upstream firstbadlink is 10.0.5.76:50010
2008-04-10 10:09:15,758 INFO org.apache.hadoop.dfs.DataNode: PacketResponder blk_-9149681676832791404 2 Exception java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at org.apache.hadoop.dfs.DataNode$PacketResponder.run(DataNode.java:1822)
at java.lang.Thread.run(Thread.java:619)
2008-04-10 10:09:15,758 INFO org.apache.hadoop.dfs.DataNode: PacketResponder 2 for block blk_-9149681676832791404 terminating
2008-04-10 10:09:16,793 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_1511572447827516117 received exception java.net.SocketTimeoutException: Read timed out
2008-04-10 10:09:16,793 ERROR org.apache.hadoop.dfs.DataNode: 10.0.5.70:50010:DataXceiver: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.net.SocketInputStream.read(SocketInputStream.java:182)
at java.io.DataInputStream.readByte(DataInputStream.java:248)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:324)
at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:346)
at org.apache.hadoop.io.Text.readString(Text.java:413)
at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1117)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
at java.lang.Thread.run(Thread.java:619)
2008-04-10 10:09:23,895 INFO org.apache.hadoop.dfs.DataNode: Receiving block blk_-7442302015902809712 src: /10.0.5.74:55546 dest: /10.0.5.74:50010
2008-04-10 10:09:23,896 INFO org.apache.hadoop.dfs.DataNode: Datanode 0 forwarding connect ack to upstream firstbadlink is
2008-04-10 10:09:23,927 INFO org.apache.hadoop.dfs.DataNode: Received block blk_-7442302015902809712 of size 902 from /10.0.5.74
2008-04-10 10:09:23,928 INFO org.apache.hadoop.dfs.DataNode: PacketResponder 0 for block blk_-7442302015902809712 terminating
2008-04-10 10:09:23,937 INFO org.apache.hadoop.dfs.DataNode: Receiving block blk_4161972554165500020 src: /10.0.11.7:41256 dest: /10.0.11.7:50010
2008-04-10 10:09:23,968 INFO org.apache.hadoop.dfs.DataNode: Datanode 2 got response for connect ack from downstream datanode with firstbadlink as
2008-04-10 10:09:23,969 INFO org.apache.hadoop.dfs.DataNode: Datanode 2 forwarding connect ack to upstream firstbadlink is
2008-04-10 10:09:24,034 INFO org.apache.hadoop.dfs.DataNode: Received block blk_4161972554165500020 of size 95 from /10.0.11.7
2008-04-10 10:09:24,034 INFO org.apache.hadoop.dfs.DataNode: PacketResponder 2 for block blk_4161972554165500020 terminating
2008-04-10 10:09:27,664 INFO org.apache.hadoop.dfs.DataNode: Datanode 2 got response for connect ack from downstream datanode with firstbadlink as 10.0.5.76:50010
2008-04-10 10:09:27,664 INFO org.apache.hadoop.dfs.DataNode: Datanode 2 forwarding connect ack to upstream firstbadlink is 10.0.5.76:50010
2008-04-10 10:09:27,665 INFO org.apache.hadoop.dfs.DataNode: PacketResponder blk_-5192866954303337577 2 Exception java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at org.apache.hadoop.dfs.DataNode$PacketResponder.run(DataNode.java:1822)
at java.lang.Thread.run(Thread.java:619)