Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-26371

Prioritize meta region move over other region moves in region_mover

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.0
    • 2.5.0, 3.0.0-alpha-2, 1.7.2, 2.4.8, 2.3.8
    • None
    • None
    • Reviewed

    Description

      We have seen few issues in production when meta region movement took some time from one server to another and in the meanwhile some other system table's regions were also moved (that were hosted on the same server) simultaneously but when non-meta system regions came online on other servers, the new servers could not make info:sn update to meta table for updated destination of system regions (e.g namespace region) and at the same time, active master was also bounced and the new active master that comes online usually reads namespace region's location from meta table and considers it as final, hence even if for instance, namespace region is already online (but on different host), the inconsistent info:sn value would prevent master from getting initialized because it keeps waiting for namespace region's availability on old regionserver. In this case, we need to make special arrangement to bring namespace region online on the old server only.

      2021-10-12 20:00:00,630 INFO [1f507eff84ef336f1250] regionserver.HRegionServer - Post open deploy tasks for hbase:namespace,1626899414773.52693312958f1f507eff84ef336f1250.
      2021-10-12 20:04:18,622 INFO [1f507eff84ef336f1250] hbase.MetaTableAccessor - Updated row hbase:namespace,1626899414773.52693312958f1f507eff84ef336f1250. with server=server-0,60020,1633467603387
      2021-10-12 20:04:18,622 INFO [1f507eff84ef336f1250] client.AsyncProcess - #27, waiting for some tasks to finish. Expected max=0, tasksInProgress=4 hasError=false, tableName=hbase:meta
      2021-10-12 20:04:18,622 INFO [1f507eff84ef336f1250] client.AsyncProcess - Left over 4 task(s) are processed on server(s): []
      2021-10-12 20:04:18,622 DEBUG [1f507eff84ef336f1250] regionserver.HRegionServer - Finished post open deploy task for hbase:namespace,1626899414773.52693312958f1f507eff84ef336f1250.
      
      

      Similar to namespace, even other user or system table regions that are hosted on the same server as meta have also encountered inconsistent state updates specifically when meta region moves around and active master is also restarted around the same time. And once active master comes online, we have to fix such inconsistencies with hbck.

      On the other hand, there have been some enhancement around not requiring meta region's colocation with active master as part of ZK-less region assignment, e.g HBASE-11610

      We have not yet enabled ZK-less region assignment entirely, only migration config is enabled i.e. hbase.assignment.usezk.migrating. With this, we expect active master to perform an additional write to meta table for the updated region state (in addition to updating RIT map in the memory of RegionStates). We have seen some hanging state here as well if meta region is going through some transition (not available) and other non-meta regions are also moved by the region mover simultaneously, and active master cannot complete meta update, which further delays intermediate state transition based ZK watcher updates.

      client.AsyncProcess - #3, waiting for 1  actions to finish on table: hbase:meta
      

      If we take a step back, and think about these issues, all issues are associated with graceful start/stop of regionservers. Region mover will try to move all regions of the given server in parallel using user configurable thread pool and hence it gives no preference to meta.

      On the other hand, after trying to reproduce this inconsistent region state behaviour with non-graceful start/stop, I have realized that we don't face such issues because ServerCrashProcedure (SCP) always prioritize meta region's availability over any other regions if the server being processed by the SCP was hosting the meta region. This is exactly what region_mover should also provide. Given that every non-meta region's location is stored in meta table, meta region must always be moved first and only after it comes online, can other regions be allowed to be moved in parallel using the configured thread pool.

      Attachments

        Activity

          People

            vjasani Viraj Jasani
            vjasani Viraj Jasani
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: