Thanks for the great reviews, everyone. I'll respond to some of the feedback now and update the patch later.
What you could do after the stat update is use tryLock for a short time. If you can't get the lock, oh well, this heartbeat response doesn't get any commands.
That's an interesting idea. I'll explore this.
I'm not sure we need yet another RPC server for this purpose.
Just to make sure everyone is aware, the additional optional RPC server is effectively already there due to committing
HDFS-9311, which specifically targeted the related problem of ZKFC health check messages getting blocked. I agree that yet another RPC server is not ideal in terms of operational complexity, but I also saw it as the only viable option short-term achievable in the 2.x line.
In BPOfferServiceActor#run we retry await operation on being interrupted. My question is when would it be safe to retry ?
In practice, I expect it never will retry. Suppose the thread enters await on the latch, but gets interrupted before the initial registration completes. The only thing that triggers thread interruption is shutdown of the whole DataNode (either the whole JVM process or a single DataNode inside a MiniDFSCluster). That means that by the time this thread gets interrupted, internal flags have already been updated such that the shouldRun() call on the next iteration will return false. run() would then return with no further action taken, and the thread would stop.
However, there is also no harm done if the await gets retried. This is a daemon thread, so even if it keeps retrying, it won't block a normal JVM exit.
Just wondering if it makes sense to move synchronized(datanodeMap) into getDataNode.
This might be a good idea, but I'd prefer to treat it as a separate code improvement decoupled from the work here. Right now, there are multiple points in the code that depend on specific lock ordering of heartbeatManager first followed by datanodeMap second to prevent deadlock. The current code makes this explicit with nested synchronized (X) blocks instead of implicit by declaring particular methods synchronized. Also, HDFS-8966 is making a lot of the changes in the locking here, so I expect making changes now would just create merge conflicts for that effort later.
Did you intend to call heartbeatManager.updateLifeline inside the synchronized(datanodeMap) or just inside synchronized (heartbeatManager). Do we need to keep the lock on datanodeMap while updating stats ?
This locking was intentional. If we do not hold the lock on datanodeMap during the get+update, then there is a risk that multiple heartbeats in flight for the same DataNode could cause a lost update, or even an inconsistency where the final state of the DatanodeDescriptor really contains some stats from one heartbeat and other stats from another heartbeat. I have not observed excessive lock contention here during the issues that prompted me to file this JIRA, so I didn't try hard to optimize this locking away. Some of the work in HDFS-8966 is likely to reduce the locking here anyway.
NN should still enforce a max number of skips and guarantee commands are sent in bounded time. Replication or block recovery is done through an asynchronous protocol, but oftentimes clients expect them to be done "soon".
Are you saying that beyond some skipping threshold, the heartbeat should still be considered a failure, eventually causing the DataNode to be marked stale and then dead? I'm not sure how we'd set such a threshold, given that there is no SLA defined on these operations AFAIK. I'd say there is still value in keeping a DataNode alive in these cases, such as for serving reader activity.
It seems the introduction of a new RPC server is to work around the existing functionality of RPC which only support QoS based on user names.
Yes, that's partially correct.
I agree that introduction of another RPC server is in some sense a workaround. In fact, I would make the same argument for dfs.namenode.servicerpc-address in the first place too. Lack of sophisticated QoS drove us to isolate operations to a separate RPC server. (dfs.namenode.servicerpc-address also can have some side benefits in multi-homed environments that want to dedicate a whole separate network interface to certain operations.)
I think a separate RPC server is a viable short-to-mid-term solution in the 2.x line. Longer term, I'd prefer that we evolve towards more sophisticated QoS that prioritizes critical "control plane" activity like heartbeats. However, this goes deeper than just QoS at the RPC layer, because we have observed contention on the namesystem lock as a contributing factor. This implies that a full solution, running on just one RPC server, requires some kind of call pre-emption capability, or maybe a cooperative multi-tasking approach with operations having the ability to yield the lock. This of course would violate a bunch of assumptions in the NameNode code, which is why I think it's infeasible in 2.x.