Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
Reviewed
Description
One of the major strategies Hadoop uses to get scalable data processing is to move the code to the data. However, putting the DFS client on the same physical node as the data blocks it acts on doesn't improve read performance as much as expected.
After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem is due to the HDFS streaming protocol causing many more read I/O operations (iops) than necessary. Consider the case of a DFSClient fetching a 64 MB disk block from the DataNode process (running in a separate JVM) running on the same machine. The DataNode will satisfy the single disk block request by sending data back to the HDFS client in 64-KB chunks. In BlockSender.java, this is done in the sendChunk() method, relying on Java's transferTo() method. Depending on the host O/S and JVM implementation, transferTo() is implemented as either a sendfilev() syscall or a pair of mmap() and write(). In either case, each chunk is read from the disk by issuing a separahitting te I/O operation for each chunk. The result is that the single request for a 64-MB block ends up the disk as over a thousand smaller requests for 64-KB each.
Since the DFSClient runs in a different JVM and process than the DataNode, shuttling data from the disk to the DFSClient also results in context switches each time network packets get sent (in this case, the 64-kb chunk turns into a large number of 1500 byte packet send operations). Thus we see a large number of context switches for each block send operation.
I'd like to get some feedback on the best way to address this, but I think providing a mechanism for a DFSClient to directly open data blocks that happen to be on the same machine. It could do this by examining the set of LocatedBlocks returned by the NameNode, marking those that should be resident on the local host. Since the DataNode and DFSClient (probably) share the same hadoop configuration, the DFSClient should be able to find the files holding the block data, and it could directly open them and send data back to the client. This would avoid the context switches imposed by the network layer, and would allow for much larger read buffers than 64KB, which should reduce the number of iops imposed by each read block operation.
Attachments
Attachments
Issue Links
- blocks
-
HBASE-8337 Investigate why disabling hadoop short circuit read is required to make recovery tests pass consistently under hadoop2
- Closed
- contains
-
HADOOP-9983 SocketInputStream class (org.apache.hadoop.net.SocketInputStream) is not public class in 2.0.5-alpha
- Resolved
- depends upon
-
HADOOP-6311 Add support for unix domain sockets to JNI libs
- Resolved
- is depended upon by
-
ACCUMULO-884 Insight into short circuit read for local files
- Resolved
- is related to
-
HDFS-1599 Umbrella Jira for Improving HBASE support in HDFS
- Open
-
HDFS-6699 Secure Windows DFS read when client co-located on nodes with data (short-circuit reads)
- Open
-
HBASE-3529 Add search to HBase
- Closed
-
HDFS-2246 Shortcut a local client reads to a Datanodes files directly
- Closed
- relates to
-
HDFS-4284 BlockReaderLocal not notified of failed disks
- Open
-
HADOOP-3205 Read multiple chunks directly from FSInputChecker subclass into user buffers
- Closed
-
HDFS-385 Design a pluggable interface to place replicas of blocks in HDFS
- Closed