Within the Datanode, we often see contention around the FsDatasetImpl lock. This can be for various reasons, eg the DN is under IO load, and other bugs, eg HDFS-15131.
When DN slow downs happen, it is very difficult to debug what is causing it, as there are few messages in the logs which indicate what is happening.
In my experience, the presence of this log is informative:
This comes from the datanode heartbeat thread, and it is logged when the commands returned from the heartbeat cannot be enqueued, and usually this is because the thread cannot get the FsDatasetImpl lock due to contention with other slow threads.
HDFS-14997 moved the command processing to an async thread, and hence this useful message will disappear.
InstrumentedLock introduced a feature to the datanode, where it will log a stack trace if any thread holds the lock for over 300ms. However this will not catch a scenario where 10 threads each hold a lock for 200ms, leading to the next in the queue having waited over 2 seconds.
I believe it would be useful to extend InstrumentedLock to log if a thread has to wait for over some threshold. That way, we could be able to catch scenarios like the heartbeat thread shows us, but in a much more obvious way, provided lock contention is involved.