Attaching v1 of a design document for this feature. This does not include a test plan - that will follow once implementation has gone a bit further. Pasting the design doc below as well:
Design Document: Local Read Optimization
Currently, when the DFS Client is located on the same physical node as the DataNode serving the data, it does not use this knowledge to its advantage. All blocks are read through the same protocol based on a TCP connection. Early experimentation has shown that this has a 20-30% overhead when compared with reading the block files directly off the local disk.
This JIRA seeks to improve the performance of node-local reads by providing a fast path that is enabled in this case. This case is very common, especially in the context of MapReduce jobs where tasks are scheduled local to their data.
Although writes are likely to see an improvement here too, this JIRA will focus only on the read path. The write path is significantly more complicated due to write pipeline recovery, append support, etc. Additionally, the write path will still have to go over TCP to the non-local replicas, so the throughput improvements will probably not be as marked.
- As mentioned above, the majority of data read during a MapReduce job tends to be from local datanodes. This optimization should improve MapReduce performance of read-constrained jobs significantly.
- Random reads should see a significant performance benefit with this patch as well. Applications such as the HBase Region Server should see a very large improvement.
Users will not have to make any specific changes to use the performance improvement - the optimization should be transparent and retain all existing semantics.
Interaction with Current System
This behavior needs modifications in two areas:
The datanode needs to be extended to provide access to the local block storage to the reading client.
DFSInputStream needs to be extended in order to enable the fast read path when reading from local datanodes.
Unix Domain Sockets via JNI
In order to maintain security, we cannot simply have the reader access blocks through the local filesystem. The reader may be running as an arbitrary user ID, and we should not require world-readable permissions on the block storage.
Unix domain sockets offer the ability to transport already-open file descriptors from one peer to another using the "ancillary data" construct and the sendmsg(2) system call. This ability is documented in unix(7) under the SCM_RIGHTS section.
Unix domain sockets are unfortunately not available in Java. We will need to employ JNI to access the appropriate system calls.
Modify DFSClient/DataNode interaction
The DFS Client will need to be able to initiate the fast path read when it detects it is connecting to a local DataNode. The DataNode needs to respond to this request by providing the appropriate file descriptors or by reverting to the normal slow path if the functionality has been administratively disabled, etc.
Unix Domain Sockets in Java
The Android open source project currently includes support for Unix Domain Sockets in the android.net package. It also includes the native JNI code to implement these classes. Android is Apache 2.0 licensed and thus we can freely use the code in Hadoop.
The Android project relies on a lot of custom build infrastructure and utility functions. In order to reduce our dependencies, we will copy the appropriate classes into a new org.apache.hadoop.net.unix package. We will include the appropriate JNI code in the existing libhadoop library. If HADOOP-4998 (native runtime library for Hadoop) progresses in the near term, we could include this functionality there.
The JNI code needs small modifications to work properly in the Hadoop build system without pulling in a large number of Android dependencies.
Fast path initiation
When DFSInputStream is connecting to a node, it can determine whether that node is local by simply inspecting the IP address. In the event that it is a local host and the fast path has not been prohibited by the Configuration, the fast path will be initiated. The fast path is simply a different BlockReader implementation.
Fast path interface
BlockReader will become an interface, with the current implementation being renamed to RemoteBlockReader. The fast-path for local reads will be a LocalBlockReader, which is instantiated after it has been determined that the target datanode is local.
Fast path mechanism
Currently, when the DFSInputStream connects to the DataNode, it sends OP_READ_BLOCK, including the access token, block id, etc. Instead, when the fast path is desired, the client will take the following steps:
- Opens a unix socket listening in the in-memory socket namespace. The socket's name will be identical to the clientName already available in the input stream, plus a unique ID for this specific input stream (so that parallel local readers function without collision).
- Sends a new opcode OP_CONNECT_UNIX. This operation takes the same parameters as OP_READ_BLOCK, but indicates to the datanode that the client is looking for a local connection.
- The datanode performs the same access token and block validity checks as it currently does for the OP_READ_BLOCk case. Thus the security model of the current implementation is retained.
- If the datanode refuses for any reason, it responds over the block transceiver protocol with the same error mechanism as the current approach. If the checks pass:
- DN connects to the client via the unix socket.
- DN opens the block data file and block metadata file
- DN extracts the FileDescriptor objects from these InputStreams, and sends them as ancillary data on the unix domain socket. It then closes its side of the unix domain socket.
- DN sends an "OK" response via the TCP socket.
- If any error happens during this process, it sends back an error response.
- On the client side, if an error response is received from the OP_CONNECT_UNIX request, the client will mark a flag indicating that it should no longer try the fast path, and then fall back to the existing BlockReader.
- If an OK response is received, the client constructs a LocalBlockReader (LBR).
- The LBR reads from the unix domain socket to receive the block data and metadata file descriptors.
- At this point, both the TCP socket and the unix socket can be closed; the file descriptors remain valid once they have been received despite any closed sockets.
- The LBR then provides the BlockReader interface by simply calling seek(), read(), etc, on an input stream constructed from these file descriptors.
- Some refactoring may occur here to try to share checksum verification code between the LocalBlockReader and RemoteBlockReader.
The reason for the connect-back protocol rather than having the datanode simply listen on a unix socket is to simplify the integration path. In order to listen on a socket, the datanode would need an additional thread to spawn off transceivers. Additionally, it allows for a way to verify that the client is in fact reading from the datanode on the target host/port without relying on some conventional socket path.
DFS Read semantics clarification
Before embarking on the above, the DFS Read semantics should be clarified. The error handling and retry semantics in the current code are quite unclear. For example, there is significant discussion in
HDFS-127 that indicates a lot of confusion about proper behavior.
Although the work is orthogonal to this patch, it will be quite beneficial to nail down the semantics of the existing implementation before attempting to add onto it. I propose this work be done in a separate JIRA concurrently with discussion on this one, with the two pieces of work to be committed together if possible. This will keep the discussion here on-point and avoid digression into discussion of existing problems like
As described above, if any failure or exception occurs during the establishment of the fast path, the system will simply fall back to the existing slow path.
One issue that is currently unclear is how to handle IOExceptions on the underlying blocks when the read is being performed by the client. See Work Remaining below.
Since the block open() call is still being performed by the datanode, there is no loss of security with this patch. AccessToken checking is performed by the datanode in the same manner as currently exists. Since the blocks can be opened read-only, the recipient of the file descriptors cannot perform unwanted modification.
Passing file descriptors over Unix Domain Sockets is supported on Linux, BSD, and Solaris. There may be some differences in the different implementations. The first version of this JIRA should target Linux only, and automatically disable itself on platforms where it will not function correctly. Since this is an optimization and not a new feature (the slow path will continue to be supported) I believe this is OK.
Work already completed
The early work in
HDFS-347 indicated that the performance improvements of this patch will be substantial. The experimentation modified the BlockReader to "cheat" and simply open the stored blocks with standard file APIs, which had been chmodded world readable. This improved read of a 1GB from 8.7 seconds to 5.3 seconds, and improved random IO performance by a factor of more than 30.
Local Sockets and JNI Library
I have already ported the local sockets JNI code from the Android project into a local branch of the Hadoop code base, and written simple unit tests to verify its operation. The JNI code compiles as part of libhadoop, and the Java side uses the existing NativeCodeLoader class. These patches will become part of the Common project.
To aid in testing and understanding of the code, I have refactored DFSInputStream to be a standalone class instead of an inner class of DFSClient. Additionally, I have converted BlockReader to an interface and renamed BlockReader to RemoteBlockReader. In the process I also refactored the contents of DFSInputStream to clarify the failure and retry semantics. This work should be migrated to another JIRA as mentioned above.
Fast path initiation and basic operation
I have implemented the algorithm as described above and added new unit tests to verify operation. Basic unit tests are currently passing using the fast path reads.
Work Remaining / Open Questions
The current implementation of LocalBlockReader does not verify checksums. Thus, some unit tests are not passing. Some refactoring will probably need to be done to share the checksum verification code between LocalBlockReader and RemoteBlockReader.
Given that the reads are now occuring directly from the client, we should investigate whether we need to add any mechanism for the client to report errors back to the DFS. The client can still report checksum errors in the existing mechanism, but we may need to add some method by which it can report IO Errors (e.g. due to a failing volume). I do not know the current state of volume error tracking in the datanode; some guidance here would be appreciated.
Interaction with other features (e.g. Append)
We should investigate whether (and how) this feature will interact with other ongoing work, in particular appends. If there is any complication, it should be straightforward to simply disable the fast path for any blocks currently under construction. Given that the primary benefit for the fast path is in mapreduce jobs, and mapreduce jobs rarely run on under-construction blocks, this seems reasonable and avoids a lot of complexity.
Currently, the JNI library has some TODO markings for implementation of timeouts on various socket operations. These will need to be implemented for proper operation.
Given that this is a performance patch, benchmarks of the final implementation should be done, covering both random and sequential IO.
Currently, the datanode maintains metrics about the number of bytes read and written. We no longer will have accurate information unless we make reports back from the client. Alternatively, the datanode can use the "length" parameter of OP_READ_UNIX and assume that the client will always read the entirety of data it has requested. This is not a fair assumption, but the approximation may be fine.
Once the DN has sent a file descriptor for a block to the client, it is impossible to audit the byte offsets that are read. It is possible for a client to request read access to a small byte range of a block, receive a socket, and then proceed to read the entire block. We should investigate whether there is a requirement for byte-range granularity on audit logs and come up with possible solutions (eg disabling fast path for non-whole-block reads).
File Descriptors held by Zombie Processes
In practice on some clusters, DFSClient processes can stick around as zombie processes. In the TCP-based DFSClient, these zombie connections are eventually timed out by the DN server. In this proposed JIRA, the file descriptors would be already transferred, and thus would be stuck open on the zombie. This will not block file deletion, but does block the reclaiming of the blocks on the underlying file system. This may cause problems on HDFS instances with a lot of block churn and a bad zombie problem. Dhruba can possibly elaborate here.
Determining local IPs
In order to determine when to attempt the fast path, the DFSClient needs to know when it is connecting to a local datanode. This will rarely be a loopback IP address, so we need some way of determining which IPs are actually local. This will probably necessitate an additional method or two in NetUtils in order to inspect the local interface list, with some caching behavior.