I'm doing some stress/failure testing on a cluster with lots of tablets and ran into the following mess:
- TSTabletManager::GenerateIncrementalTabletReport is holding the TSTabletManager lock in 'read' mode
- it's calling CreateReportedTabletPB on a bunch of tablets which are in the process of an election storm
- each such call blocks in RaftConsensus::ConsensusState since it's in the process of fsyncing metadata to disk
- thus it's holding the read lock on TSTabletManager lock for a long time (many seconds if not tens of seconds)
- meanwhile, some other thread is trying to take TSTabletManager::lock for write, and blocked due to the above reader
- rw_spinlock is writer-starvation-free which means that no more readers can acquire the lock
What's worse is that rw_spinlock is a true spin lock, so now there are tens of threads in a 'while (true) sched_yield()' loop, generating over 1.5M context switches per second.