Combing through logs of a cluster recovering from tserver failure and reading the tablet copy code revealed that if a tserver's tablet copy threadpool is full (which defaults to 10 slots), and a duplicate tablet copy request is received from a leader, the tablet server will respond with a THROTTLED error instead of an ALREADY_INPROGRESS error. If this situation continues for 300 seconds, the leader will eject the replica, which has the effect of starting anothe tablet copy on another tserver. Meanwhile, the original tablet copy may still be making progress in the first follower tserver.
As a result, a single tablet can be in the process of being copied multiple times if a cluster is undergoing many tablet copies, for instance if a tserver with hundreds of tablets fails.
It's expected that a cluster which is recovering from tserver failure would have roughly equal aggregate read and write IO throughput, however because of this bug, the recovering cluster exhibited significantly more write disk throughput than read, since many tablet copies were served from cache. Attached is a representative graph. The spike around 5:07 is when fsyncs were disabled cluster wide (fsync tablet copy overhead is being tracked in
Thanks to Adar and Todd for helping me track this down.