Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
None
-
None
-
None
-
None
Description
(disclaimer: still performing due-diligence on this one)
I've been helping a user this week with what is thought to be a race condition in secondary index updates. This user has a relatively heavy write-based workload with a few tables that each have at least one index.
What we have seen is that when the region distribution is changing (concretely, we were doing a rolling restart of the cluster without the load balancer disabled in the hopes of retaining as much availability as possible), I've seen the following general outline in the logs:
- An index update fails (due to ERROR 2008 (INT10) the index metadata cache expired or is just missing)
- The index is taken offline to be asynchronously rebuilt
- A flush on the data table's region is queue for quite some time
- RS is asked to close a region (due to a move, commonly)
- RS aborts because the memstore for the data table's region is in an inconsistent state (e.g. Assertion failed while closing store <region> <colfam> flushableSize expected=0, actual= 193392. Current memstoreSize=-552208. Maybe a coprocessor operation failed and left the memstore in a partially updated state.
Some relevant HBase issues include HBASE-10514 and HBASE-10844.
Have been talking to ayingshu and devaraj about it, but haven't found anything definitively conclusive yet. Will dump findings here.
Attachments
Issue Links
- relates to
-
HBASE-15837 Memstore size accounting is wrong if postBatchMutate() throws exception
- Resolved