We have been looking closely at the capability introduced in this Jira because the initial results look very promising. However, after looking deeper, I’m not convinced this is an approach that makes the most sense at this time. This Jira is all about getting the maximum performance when the blocks of a file are on the local node. Obviously performance of this use case is a critical piece of “move computation to data”. However, if going through the datanode were to offer the same level of performance as going direct at the files, then this Jira wouldn’t even exist. So, I think it’s really important for us to understand the performance benefits of going direct and the real root causes of any performance differences between going direct and having the data flow through the datanode. Once that is well understood, then I think we could look at the value proposition of this change. We’ve tried to do some of this analysis and the results follow. Key word here is “some”. I feel we’ve gathered enough data to draw some valuable conclusions, but I don’t think it’s enough data to say this type of approach wouldn’t be worth pursuing down the road.
For the impatient, the paragraphs below can be summarized with the following points:
+ Going through the datanode maintains architectural layering. All other things being equal, it would be best to avoid exposing the internal details of how the datanode maintains its data. Violations of this layering could paint us into a corner down the road and therefore should be avoided.
+ Benchmarked localhost sockets at 650MB/sec (write->read) and 1.6GB/sec(sendfile->read). nc uses 1K buffers and this probably explains the low bandwidth observed as part of this jira.
+ Measured maximum client ingest rate at 280MB/sec for sockets. Checksum calculation seems to play a big part of this limit.
+ Measured maximum datanode streaming output rate of 827MB/sec.
+ Measured maximum datanode random read output rate of 221MB/sec (with hdfs-941).
+ The maximum client ingest rate of 280MB/sec is significantly slower than the maximum datanode streaming output rate of 827MB/sec and only marginally faster than the maximum datanode random output rate of 221MB/sec. This seems to say that with the current bottlenecks, there isn’t a ton of performance to be gained from going direct, at least not for the simple test cases used here.
For the detail oriented, keep reading.
If everything were optimized in the system then going direct is certainly going to have a performance advantage (less layers means higher top-end performance). However, the questions are:
+ How much of a performance gain?
+ Can this gain be realized with existing use cases?
+ Is the gain worth the layering violations? For example, what if we decided to automatically merge small blks into single files? In order to access this data directly, both the datanode and the client side code would have to be cognizant of this format. Or what if we wanted to support encrypted content? Or if we wanted to handle I/O errors differently than they’re handled today? I’m sure there are others I’m not thinking of.
Ok, now for some data.
One of the initial comments talked about overhead of localhost network connections. The comment used nc to measure bandwidth through a socket vs bandwidth through a pipe. We looked into this a little because this was a bit surprising. Sure enough on my rhel5 system, I saw pretty much the same numbers. Digging deeper, nc uses a 1K buffer in rhel5, this can’t be good for throughput. So, we ran lmbench on the same system to see what sort of results we get. localhost sockets and pipes both came in right around 660MB/sec with 64K blocksizes. Pipes will probably scale up a bit better across more cores but I would not expect to see a 5x difference as the original nc experiment showed. We also modified lmbench to use sendfile() instead of write() in the local socket test and measured this throughput to be 1.6GB/sec.
CONCLUSION: A localhost socket should be able to move around 650MB/sec for write->read, and 1.6GB/sec for sendfile->read.
The remaining results involve hdfs. In these tests the blks being read are all in the kernel page cache. This was done to completely remove disk seek latencies from the equation and to completely highlight any datanode overheads. io.file.buffer.size was 64K in all tests. (Todd measured a 30% improvement using the direct method with checksums enabled. I can’t completely reconcile this improvement with the results below but I’m wondering if it’s due to that test using the default of 4K buffers??? I think the results of that test would be consistent with the results below if that were the case. In any event it would be good to reconcile the differences at some point.)
The next piece of data we wanted was the maximum rate at which the client can ingest data. The first thing we did was to run a simple streaming read. In this case we saw about 280 MB/sec. This is nowhere near 1.6GB/sec so the bottleneck must be either the client and/or the server (i.e. it’s not the pipe). The client process was at 100% CPU, so it’s probably there. To verify, we disabled checksum verification on the client and this number went up to 776MB/sec and client CPU utilization was still 100%. The bottleneck appears to still be at the client. This is most likely due to the fact that the client has to actually copy the data out of the kernel while the datanode uses sendfile.
CONCLUSION: Maximum client ingest rate for a stream is around 280MB/sec. Datanode is capable of streaming out at least 776MB/sec. Given current client code, there would not be a significant advantage to going direct to the file because checksum calculation and other client overheads limit its ingestion rate to 285MB/sec and the datanode is easily capable of sustaining this rate for streaming reads.
The next thing we wanted to look at was random I/O. There is a lot more overhead on the datanode for this particular use case so this could be a place where direct access could really excel. The first thing we did here was run a simple random read test to again measure the maximum read throughput. In this case we measured 105MB/sec. Again we tried to eliminate the bottlenecks. However, it’s more complicated in the random read case due to the fact that it is a request/response type of protocol. So, first we focused on the datanode. hdfs-941 is a proposed change which helps the pread use case significantly. The implementation in 941 seems very reasonable and looks to be wrapping up very soon. So, we applied the 941 patch and this improved the throughput to 143MB/sec.
This isn’t at the 285MB/sec yet so it’s still conceivable that going direct could add a nice boost.
Since this is a request/response protocol, the checksum processing on the client will impact the overall throughput of random I/O use cases. With checksums disabled, the random I/O throughput increased from 143MB/sec to 221MB/sec.
CONCLUSION: A localhost socket maxes out at around 1.6GB/sec, we measured 827MB/sec for no-checksum streaming reads. The datanode is currently not capable of maxing out a localhost socket.
CONCLUSION: Clients can currently ingest about 280MB/sec. This rate is easily reached with streaming reads. For random reads, with
HDFS-941, this rate is a bit faster (280MB/sec vs 221MB/sec) but not dramatically so. Therefore, for today the right approach seems to be to enhance the datanode to make sure the bottleneck is squarely at the client. Since the bottleneck is mainly due to checksum calculation and data copies out of the kernel, going direct to a blk file shouldn’t have a significant impact because both of these overhead activities need to be performed whether going direct or not.
The results above are all in terms of single reader throughput of cached blk files. More scalability testing needs to be performed. We did verify that on a dual-quad core system that the datanode could scale its random read throughput from 137MB/sec to 480MB/sec with 4 readers. This was enough load to saturate 5 of the 8 cores with clients consuming 3 and datanodes consuming 2. It’s just one data point, there’s lots more work to be done in the area of datanode scalability.
Latency is also a critical attribute of the datanode and some more data needs to be gathered in this area. However, I propose we focus on fixing any contention/latency issues within the datanode prior to diving into a direct I/O sort of approach (and there are already a few jiras out there that are in the area of improving concurrency within the datanode). If we can’t get anywhere near the latency requirements, then at that point we should consider more efficient ways of getting at the data.
Thanks to Kihwal Lee and Dave Thompson for doing a significant amount of data gathering! Gathering this type of data always seems to take longer than one would think, so thank you for the efforts.