Thanks Xiaoyu Yao. I posted v5 patch to address your comments #1, 2 and 3.
Only tracking the latency of sending packet to the last node in pipeline is a conscious design choice.
In the case of pipeline [dn0, dn1, dn2], 5ms latency from dn0 to dn1, 100ms from dn1 to dn2, NameNode claims dn2 is slow since it sees 100ms latency to dn2. Note that NameNode is not ware of pipeline structure in this context and only sees latency between two DataNodes.
In another case of the same pipeline, 100ms latency from dn0 to dn1, 5ms from dn1 to dn2, NameNode will miss detecting dn1 being slow since it's not the last node. However the assumption is that in a busy enough cluster there are many other pipelines where dn1 is the last node, e.g. [dn3, dn4, dn1]. Also our tracking interval is relatively long enough (at least an hour) to improve the chances of the bad DataNodes being the last nodes in multiple pipelines.