[PHOENIX-2883] Region close during automatic disabling of index for rebuilding can lead to RS abort - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

(disclaimer: still performing due-diligence on this one)

I've been helping a user this week with what is thought to be a race condition in secondary index updates. This user has a relatively heavy write-based workload with a few tables that each have at least one index.

What we have seen is that when the region distribution is changing (concretely, we were doing a rolling restart of the cluster without the load balancer disabled in the hopes of retaining as much availability as possible), I've seen the following general outline in the logs:

An index update fails (due to ERROR 2008 (INT10) the index metadata cache expired or is just missing)
The index is taken offline to be asynchronously rebuilt
A flush on the data table's region is queue for quite some time
RS is asked to close a region (due to a move, commonly)
RS aborts because the memstore for the data table's region is in an inconsistent state (e.g. Assertion failed while closing store <region> <colfam> flushableSize expected=0, actual= 193392. Current memstoreSize=-552208. Maybe a coprocessor operation failed and left the memstore in a partially updated state.

Some relevant HBase issues include ~~HBASE-10514~~ and ~~HBASE-10844~~.

Have been talking to ayingshu and devaraj about it, but haven't found anything definitively conclusive yet. Will dump findings here.

Attachments

Issue Links

relates to

HBASE-15837 Memstore size accounting is wrong if postBatchMutate() throws exception

Resolved

Activity

People

Assignee:: Josh Elser

Reporter:: Josh Elser

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 06/May/16 22:00

Updated:: 07/Feb/18 07:54

Resolved:: 16/May/16 19:05