Uploaded image for project: 'Phoenix'
  1. Phoenix
  2. PHOENIX-2883

Region close during automatic disabling of index for rebuilding can lead to RS abort



    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • None
    • None
    • None
    • None


      (disclaimer: still performing due-diligence on this one)

      I've been helping a user this week with what is thought to be a race condition in secondary index updates. This user has a relatively heavy write-based workload with a few tables that each have at least one index.

      What we have seen is that when the region distribution is changing (concretely, we were doing a rolling restart of the cluster without the load balancer disabled in the hopes of retaining as much availability as possible), I've seen the following general outline in the logs:

      • An index update fails (due to ERROR 2008 (INT10) the index metadata cache expired or is just missing)
      • The index is taken offline to be asynchronously rebuilt
      • A flush on the data table's region is queue for quite some time
      • RS is asked to close a region (due to a move, commonly)
      • RS aborts because the memstore for the data table's region is in an inconsistent state (e.g. Assertion failed while closing store <region> <colfam> flushableSize expected=0, actual= 193392. Current memstoreSize=-552208. Maybe a coprocessor operation failed and left the memstore in a partially updated state.

      Some relevant HBase issues include HBASE-10514 and HBASE-10844.

      Have been talking to ayingshu and devaraj about it, but haven't found anything definitively conclusive yet. Will dump findings here.


        Issue Links



              elserj Josh Elser
              elserj Josh Elser
              0 Vote for this issue
              6 Start watching this issue