While testing the write capacity of a 4 machine hbase cluster we were getting long and frequent client pauses as we attempted to load the data. Looking into the problem we'd get a relatively large compaction queue and when a region hit the "hbase.hstore.blockingStoreFiles" limit it would get block the client and the compaction request would get put on the back of the queue waiting for many other less important compactions. The client is basically stuck at that point until a compaction is done. Prioritizing the compaction requests and allowing the request that is blocking other actions go first would help solve the problem.
You can see the problem by looking at our log files:
You'll first see an event such as a too many HLog which will put a lot of requests on the compaction queue.
Then you see the too many log files:
Which leads to this:
Once the compaction / split is done a flush is able to happen which unblocks the IPC allowing writes to continue. Unfortunately this process can take upwards of 15+ minutes (the specific case shown here from our logs took about 4 minutes).