One thing we noticed after adding ksck checksum scanners to YCSB is that 10% of the jobs die because one RPC timed out. The jobs run after the "ycsb run" phase and there's usually a huge compactions backlog, so those long checksum scans happen concurrently with writes. Not only are RPCs slowed down, but we also see compactions going from completing in 4 secs to taking up to 3 minutes.
Todd suspects a "fairly well known behavior where a constant read workload can delay ext4 checkpoints from happening" and pointed to http://www.spinics.net/lists/linux-ext4/msg25761.html and https://lkml.org/lkml/2011/6/9/647.
It would be nice to get a better understanding of where we're blocking. If we're starving writes, why are reads also blocked? Adding more tracing would help.