|
bummer... transferTo does not handle non-blocking sockets well. I have a patch that works fine when the socket channel is in blocking mode. JavaDoc clearly indicates it can handle non-blocking sockets!. One hack is to interprete IOException and look for "Resource temporarily" at the beginning of the message. I get the following with non-blocking sockets : java.io.IOException: Resource temporarily unavailable
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:418)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:519)
at org.apache.hadoop.dfs.DataNode.transferToFully(DataNode.java:1500)
at org.apache.hadoop.dfs.DataNode.access$900(DataNode.java:84)
at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1807)
at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1880)
at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1032)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:961)
at java.lang.Thread.run(Thread.java:619)
Looks like all it needs to do is just to look for EAGAIN from the system call. 'mostly good' patch attacheed.
With the patch,initial tests show DataNode takes about 1/5th to 1/4th the CPU compared to trunk while reading data. That is about 10 times faster than 0.16. With the patch, for majority of the data the path changes from The main remaining issue is with non-blocking sockets, as mentioned in the previous comment. One option is to blocking sockets and have one thread that enforces write timeout. The last comment in
This patch does not make the socket blocking. Is the following hack to check for "EAGAIN" acceptable? The attached patch passes all the tests on Linux.
[...] try { nTransfered = (int) inChannel.transferTo(startPos, len, outChannel); } catch (IOException e) { /* at least jdk1.6.0 on Linux seems to throw IOException * when the socket is full. Hopefully near future verisions will * handle EAGAIN better. For now look for a specific string in for * the message for the exception. */ if (e.getMessage().startsWith("Resource temporarily unavailable")) { out.waitForWritable(); continue; } else { throw e; } } The IOException message could be different on other systems. For now, we could use transferTo() only on the systems where we know the text of the message. I think this issue will be fixed in next versions of JRE. On windows, transferTo() returns 0 as expected. This is what I am planning to do.
Here, out is a o.a.h.ipc.SocketOutputStream. I think using the real socket channel for Linux is worth the CPU savings. > transferTo does not handle non-blocking sockets well.
Looks like this is fixed in Java SE 7 (Dolphin) http://bugs.sun.com/view_bug.do?bug_id=5103988 Edit: corrected url for the bug. The following table shows 'dfs -cat' of 10Gb of data. This is a disk bound test and CPU is measured from /proc/pid/stat. io.file.buffer.size is 128k. This is cluster with single datanode and the client and datanode are on the same machine. The three fields reported for each run are user cpu, kernel cpu, and wall clock time. "total cpu" is sum of user and kernel cpu for DataNode process.
This shows DataNode takes about 80% less cpu. Also, since we don't actually allocate any user buffer, we could actually invoke transferTo() to send even larger amounts of data at a time. I haven't experimented with larger buffer sizes.
I will update the patch. I tried enabling this on Mac OS X 10.5.2 (added a comparison for Mac OS X). Throughput dropped by 8x and CPU only dropped by 2x. My test was just cat'ing a 13G file out of a DataNode with 2 disks:
hadoop -fs cat wikipedia.xml > /dev/null Without the patch I can get around 110-120MB/s (about the average speed of the two disks) while with the patch enabled I get around 16MB/s. Updated patch includes feedback from Konstantin. Also improves documentation.
Regd Sam's experiment on Mac OS : Same result. Each read is ~4k and makes it go quite slow.
Tried Java 5 & 6 for good measure. Thanks Sam. I am assuming you ran with large value (like 64 KB or 128KB) for io.file.buffer.size. Given this, options I can think of are :
I hope last two options are ruled out. My preference is 2nd option. Every option has (obvious) pros and cons. One isolated and well commented check for OS is not such a terrible thing (well, may be it is). Hadoop already has those.. its not the first such check. It does not mean we are in favor of such checks. Do I need to make a better case? Votes are welcome. Tested this with io.file.buffer.size set to 128K instead of the default 4K and got great results. Over 50% reduction in CPU time used by the DataNode. We can't have the default be on if the default buffer size is small.
It would be nice if we can keep code same on all platforms as much as possible. One option would be to raise the default io.file.buffer.size to something like 64K and use FileChannel.transferTo on all platforms.
Not all OSes are so bad. Linux does not suffer from such a problem. Not sure whether it is Java or Mac OS that is some how prohibiting read ahead for these sequential reads.
So we could enable this on datanode as long as the buffer size is >= 64k. Of course, Linux with smaller buffer sizes will also suffer. So on Linux with 4k buffersizes you see strong performance with this patch? Needing to raise the buffersize is limited to only Mac? It is hard to tell from your previous comment.
Yes. On Linux there is not much difference between the two cases with buffer size set to 4k (based on my preliminary tests).. iostat looks pretty much the same (size of each read is large) with and without the patch.
Yeah, not clear where the fault is on the Mac side. WIthout this patch there isn't much difference in performance with the two buffer sizes so it must be related to the NIO implementation in some way.
The point of the patch is to bypass the buffer. So making the buffer big doesn't improve the utility of transferTo(), rather just hides the fact that the implemention of transferTo() on the Mac sucks. It thus makes no sense to gate the use of transferTo() on the bufferSize.
Also, increasing the default buffer size can have a significant impact on memory usage, since some applications open lots of streams, and each open stream frequently has several buffers. When we find that a particular operation benefits from a larger buffer, it is usually best to increase just its buffer size rather than the default buffer size for all Hadoop i/o streams. I think option (2) still looks best. I agree Doug. I retested the pre-patch code on the Mac with the increased buffer size and it is faster and uses the same amount of CPU as the patched code does. Looks like no benefit on Mac OS X with the Apple VMs anyway. I haven't tested Soylatte.
Yes. I was not planning to increase default buffer size in this patch. Currently I am looking into a dip in DFSIO read test. I will upload a new version of the patch after fixing it.
Buffer size might matter on Linux also to certain extent (may be 15 - 20%) between 4k and 128k. I could not reproduce this on my deb box, where the disk maxes out at 30-35 MB/s, but DFSIO seems to show. TestDFSIO essentially tests (bursty) write and read of a 320MB file. Each mapper writes or reads such a file reports time taken for this IO. With the patch results showed 10-20% dip in DFSIO (smaller the cluster, larger difference). To avoid misconfiguration, I tried a path with hard coded 128KB buffer size while sending block and results came back to normal. Once we confirm misconfiguration, we can consider a buffer size "cut off" for transferTo(). With transferTo(), DataNode does not actually allocate the buffer. In that sense, we could increase the size in DataNode without affecting client buffering (apart from slight increase in buffer for checksum).
I mean, DataNode could use some thing like max(64KB, configured buffer size) when transfer to is enabled. 64KB implies 512 bytes of checksum data. So client needs to read 512 bytes of checksum before it reads actual data. Just confirmed that the DFSIO test in fact ran with a buffer size of 4096 for DataNode (not sure about the client). Given that smaller buffer size for transferTo() could negatively impact DataNode reads, any preference for the following options (when transferTo() is enabled by config) ?:
My first choice : #2, second : #3. 'cut off' could be something like 128KB. Changes in the updated patch :
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12380800/HADOOP-3164.patch against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2314/testReport/ This message is automatically generated. I tested reading a 2GB file on a small windows box. Test takes same amount of wall clock time (5-6min) in both cases. Didn't actually check CPU.
+1. Code looks good.
I ran tests on Windows and Linux in order to verify that there is no degradation in performance. Pre-patch and post-patch numbers on both systems are comparable. Integrated in Hadoop-trunk #470 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/470/
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This essentially means we need to implement write-timeout ourselves instead of using wait implementation in SocketOutputStream. May be SocketOutputStream could support something like waitForRead().