Thanks for the comments, everyone. Let's discuss the SASL point first, because it could shift the design and make the specific questions about the proposed protocol change irrelevant.
Did you consider at all scrapping our custom authentication protocol and instead switching to using straight SASL MD5-DIGEST for the DataTransferProtocol?
Thanks for pointing out
HDFS-3637. After further review of that patch, I see how we can iterate on that. I think it also has some benefits over the proposal that I posted: 1) consistency with authentication in the rest of the codebase, and 2) enabling encryption would defeat a man-in-the-middle attack without causing harm to intermediate proxy deployments like source address validation might cause. I'd like to explore the SASL solution further.
The only potential downside I see is that if we ever pipeline multiple operations over a single connection, then we'd need to renegotiate SASL per operation, because the authorization decision may be different per block. This doesn't seem like an insurmountable problem though.
I have a question about the compatibility impact of
HDFS-3637. I see that an upgraded client can talk to an old cluster, and an old client can talk to an upgraded cluster if encryption is off. It looks like if it's an upgraded cluster and encryption is on, then DataXceiver will not run operations sent from unencrypted client connections, including connections initiated from an old client. This implies that all clients must be upgraded before it's safe to turn on encryption in the cluster. Do I understand correctly? If so, can we relax this logic a bit to allow for compatibility of an old client connected to an upgraded cluster with SASL on? The design doc proposed checking whether or not the datanode port is < 1024, and if so, then allow the old connection. The thinking here is that anyone continuing to run on a port < 1024 must still have a component that hasn't upgraded, so therefore it needs to support the old connection. Once datanode has been reconfigured to run on a port >= 1024, then all non-encrypted connections can be rejected.
Also, I wasn't sure about how the
HDFS-3637 patch impacts compatibility for inter-datanode connections. Is it possible to have a mix of old and upgraded datanodes running, some with encryption on and some with encryption off, or does it require a coordinated push to turn on encryption across the whole cluster?
We wanted to be conscious of backwards compatibility with this change, particularly for a rolling upgrade scenario.