Esteban Gutierrez, Chris Nauroth: I took another look at the patch and I think that it's correct as-is.
The confusing part about all of this is that the DFSInputStream uses the Peer classes to encapsulate sockets, whereas the DFSOutputStream simply uses "raw" socket objects. The word "Peer" does not occur in DFSOutputStream.java or DataStreamer.java. It only is used in the input stream code. Therefore, since this patch only changes newConnectedPeer, it only affects DFSInputStream.
In contrast, the issue Chris brought up is actually only a DFSOutputStream issue. Chris commented:
Right now, write timeout is enforced not as a socket option, but instead enforced per operation by passing the timeout to SocketOutputStream, which uses it in the underlying NIO selector calls. The exact write timeout value is not purely based on configuration. It's also a function of the number of nodes in the write pipeline. The details are implemented in DFSClient#getDatanodeWriteTimeout
But DFSClient#getDatanodeWriteTimeout is less generic than it may seem. It actually is only used by the DFSOutputStream (write path) code to calculate how long to set the timeout based on the number of nodes in the pipeline. It is never called by the input stream code at all. So I believe that this concern is not valid.
I see the original problem reported in
HDFS-7005 was related to lack of read timeout. I'm wondering if there is actually no further change required for write timeout, given the above explanation. Is anyone seeing an actual problem related to lack of write timeout?
Good question. Here is the scenario where we can run into problems in the DFSInputStream code:
DFSInputStream#getBlockReader uses BlockReaderFactory#build to create a new RemoteBlockReader2. Ultimately BlockReaderFactory#nextTcpPeer calls into DFSClient#newConnectedPeer. As you can see from this patch, this function sets a read timeout, but no write timeout.
I should also make an additional comment here: the Peer class we will be creating is NioInetPeer. You can see that NioInetPeer.java contains a org.apache.hadoop.net.SocketOutputStream inside it. This is the same class which DFSOutputStream is using to "simulate" socket write timeouts by using non-blocking I/O and selectors. But whereas DFSOutputStream uses org.apache.hadoop.net.SocketOutputStream directly, DFSInputStream uses it "under the covers" by using the Peer class. So the timeout mechanism is actually the same (Chris, I think this answers another question you raised.)
I always found it very confusing that org.apache.hadoop.net.SocketOutputStream is named the same as java.net.SocketOutputStream. Just for anyone new to Hadoop reading this-- they are very different classes!
So RemoteBlockReader2#newBlockReader has this code:
final DataOutputStream out = new DataOutputStream(new BufferedOutputStream(
new Sender(out).readBlock(block, blockToken, clientName, startOffset, len,
DataInputStream in = new DataInputStream(peer.getInputStream());
BlockOpResponseProto status = BlockOpResponseProto.parseFrom(
checkSuccess(status, peer, block, file);
ReadOpChecksumInfoProto checksumInfo =
You can see that there is a potential for things to get stuck here since we have not set a write timeout. That's the case this patch fixes. We're seeing this in production now so it's definitely more than a theoretical problem.
I am +1 on the patch. Chris Nauroth, can you take another look and verify that this answers your questions? We will hold off on committing until that point.