In our HDFS cluster we observed that append operation can take as much as 10X write lock time than other write operations. By collecting flamegraph on the namenode (see attachment: append-flamegraph.png), we found that most of the append call is spent on getNumLiveDataNodes():
this method synchronizes on the DatanodeManager which is particularly expensive in large clusters since datanodeMap is being modified in many places such as processing DN heartbeats.
For append operation, getNumLiveDataNodes() is invoked in isSufficientlyReplicated:
The way that the replication is calculated is not very optimal, as it will call getNumLiveDataNodes() every time even though usually minReplication is much smaller than the latter.