We have run into an issue in a mid-sized cluster with secondary indexes. The problem is that all handlers for doing writes were blocked waiting on a single scan from the secondary index to complete for > 5mins, thus causing all incoming RPCs to timeout and causing write un-availability and further problems (disabling the index, etc). We've taken jstack outputs continuously from the servers to understand what is going on.
In the jstack outputs from that particular server, we can see three types of stacks (this is raw jstack so the thread names are not there unfortunately).
- First, there are a lot of threads waiting for the MVCC transactions started previously:
The way MVCC works is that it assumes that transactions are short living, and it guarantees that transactions are committed in strict serial order. Transactions in this case are write requests coming in and being executed from handlers. Each handler will start a transaction, get a mvcc write index (which is the mvcc trx number) and does the WAL append + memstore append. Then it marks the mvcc trx to be complete, and before returning to the user, we have to guarantee that the transaction is visible. So we wait for the mvcc read point to be advanced beyond our own write trx number. This is done at the above stack trace (waitForPreviousTransactionsComplete). A lot of threads with this stack means that one or more handlers have started a mvcc transaction, but did not finish the work and thus did not complete their transactions. MVCC read point can only advance serially, otherwise ongoing transactions will be visible, thus we just wait for the previously started transactions to complete.
- second set of threads are waiting with this stack trace:
These are the threads that want to write something, but they have to wait for the row lock(s) that are held from the other threads. It is very likely that the threads in the previous set are holding the actual row locks that these set of threads wants to acquire. This is normal behavior, nothing concerning. However, if the handler cannot acquire the row lock in 60 seconds, then the handler will throw an exception, and RPC will be retried.
- Finally the only thread in the third set is doing a scan for the index update:
This thread is doing a "scan" to be able to compute the index updates (note HRegion.doMiniBatchMutation -> IndexBuildManager.getIndexUpdate -> Region Scan).
Normally, this scan should finish in a short interval, but for whatever reason it takes more than 5 minutes. We know that it takes more than 5 minutes, because in the jstacks from epoch 1462985796 to epoch 1462985958, we see the exact same thread (Thread 15276) continuing the scan.
So, we do a skip scan which takes >5 minutes while holding some row locks, and holding the MVCC write transaction open. All the other handlers that start write transactions after this also holds some other row locks and waits for the handler doing the scan to finish. The other handlers are just waiting for some of the same row locks. So in effect everybody is waiting on one handler thread to finish the scan.
So, why we are doing a scan in the index builder code, and why it is taking more than 5 minutes?
Apparently, we are doing a skip scan in PhoenixIndexBuilder:
I think that for some cases, this skip scan turns into an expensive and long scan (instead of point lookup scan). However, there is no easy way to debug the actual rows and the schema from that.
The comments say that it is done to bring the results to the block cache for the following gets to come. However, there should not be a need for that at all since the reads themselves will bring those results to the block cache. The only explanation maybe to use prefetching from the skip scan, but it is still not correct to do double-work.
The other thing is that we execute the getIndexUpdate()'s in parallel later, but we are still processing the skip scan serially.