On a heavily loaded cluster, WAL count keeps rising and we can get into a state where we are not rolling the logs off fast enough. In particular, there is this interesting state at the extreme where we pick a region to flush because 'Too many WALs' but the region is actually not online. As the WAL count rises, we keep picking a region-to-flush that is no longer on the server. This condition blocks our being able to clear WALs; eventually WALs climb into the hundreds and the RS goes zombie with a full Call queue that starts throwing CallQueueTooLargeExceptions (bad if this servers is the one carrying hbase:meta): i.e. clients fail to access the RegionServer.
One symptom is a fast spike in WAL count for the RS. A restart of the RS will break the bind.
Here is how it looks in the log:
I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version regularly that had
HBASE-16721 fix in it, but can't say yet if it was for same reason as above.