During Master initialization we
- restore existing procedures that still need to run from prior active Master instances
- look for signs that Region Servers have died and need to be recovered while we were out and schedule a ServerCrashProcedure (SCP) for each them
- turn on the assignment manager
The normal turn of events for a ServerCrashProcedure will attempt to use a bulk assignment to maintain the set of regions on a RS if possible. However, we wait around and retry a bit later if the assignment manager isn’t ready yet.
Note that currently #2 has no notion of wether or not a previous active Master instances has already done a check. This means we might schedule an SCP for a ServerName (host, port, start code) that already has an SCP scheduled. Ideally, such a duplicate should be a no-op.
However, before step #2 schedules the SCP it first marks the region server as dead and not yet processed, with the expectation that the SCP it just created will look if there is log splitting work and then mark the server as easy for region assignment. At the same time, any restored SCPs that are past the step of log splitting will be waiting for the AssignmentManager still. As a part of restoring themselves, they do not update with the current master instance to show that they are past the point of WAL processing.
Once the AssignmentManager starts in #3 the restored SCP continues; it will eventually get to the assignment phase and find that its server is marked as dead and in need of wal processing. Such assignments are skipped with a log message. Thus as we iterate over the regions to assign we’ll skip all of them. This non-intuitively shifts the “no-op” status from the newer SCP we scheduled at #2 to the older SCP that was restored in #1.
Bulk assignment works by sending the assign calls via a pool to allow more parallelism. Once we’ve set up the pool we just wait to see if the region state updates to online. Unfortunately, since all of the assigns got skipped, we’ll never change the state for any of these regions. That means the bulk assign, and the older SCP that started it, will wait until it hits a timeout.
By default the timeout for a bulk assignment is the smaller of (# Regions in the plan * 10s) or (# Regions in the most loaded RS in the plan * 1s + 60s + # of RegionServers in the cluster * 30s). For even modest clusters with several hundreds of regions per region server, this means the “no-op” SCP will end up waiting ~tens-of-minutes (e.g. ~50 minutes for an average region density of 300 regions per region server on a 100 node cluster. ~11 minutes for 300 regions per region server on a 10 node cluster). During this time, the SCP will hold one of the available procedure execution slots for both the overall pool and for the specific server queue.
As previously mentioned, restored SCPs will retry their submission if the assignment manager has not yet been activated (done in #3), this can cause them to be scheduled after the newer SCPs (created in #2). Thus the order of execution of no-op and usable SCPs can vary from run-to-run of master initialization.
This means that unless you get lucky with SCP ordering, impacted regions will remain as RIT for an extended period of time. If you get particularly unlucky and a critical system table is included in the regions that are being recovered, then master initialization itself will end up blocked on this sequence of SCP timeouts. If there are enough of them to exceed the master initialization timeouts, then the situation can be self-sustaining as additional master fails over cause even more duplicative SCPs to be scheduled.
- Master appears to hang; failing to assign regions to available region servers.
- Master appears to hang during initialization; shows waiting for the meta or namespace regions.
- Repeated master restarts allow some progress to be made on assignments for a limited period of time.
- Master UI shows a large number of Server Crash Procedures in RUNNABLE state and the number increases by roughly the number of Region Servers on master restart.
- Master log shows a large number of messages that assignment of a region has failed because it was last seen on a region server that has not yet been processed. These messages come from the AssignmentManager logger. This message should normally only occur when a Region Server dies just before some assignment is about to happen. When this combination of issues happens the message will happen repeatedly; every time a new defunct SCP is processed it’ll happen for each region.
Example of aforementioned message:
The procedure we currently have to reproduce this issue requires specific timings that can be hard to get correctly, so this might require multiple tries.
Before starting, the test cluster should have the following properties:
- Only 1 master. Not a strict requirement but it helps in the following steps.
- Hundreds of regions per region server. If you need more, fire up your preferred data generation tool and tell it to create a large enough table. Those regions can be empty, no need to fill them with actual data.
- At least ten times more region servers than available CPUs on the master. If the number of CPUs is too high, set hbase.master.procedure.threads in the safety valves to number-of-region-servers divided by 10. For example, if you have 24 cores and a 10 nodes cluster, set the configuration to 2 or 3.
Set your environment the following way:
- Access to cluster-wide shutdown/startup via whatever control plane you use
- Access to master-role specific restart via whatever control plane you use
- A shell on the master node that tails the master log (use tail -F with a capital F to ride over log rolls).
- A shell on the master node that’s ready to grab the HMaster’s pid and kill -9 it.
The procedure is:
- Start with a cluster as described above.
- Restart the entire HBase cluster.
- When the master log shows “Clean cluster startup. Assigning user regions” and it starts assigning regions, kill -9 the master.
- Stop the entire HBase cluster
- Restart the entire HBase cluster
- Once the master shows “Found regions out on cluster or in RIT”, kill -9 the master again.
- Restart only the Master.
- (Optionally repeat the above two steps)
You know that you hit the bug when the given indicators show up.
If it seems like it’s able to assign a lot of regions again, try kill -9 again and restart the Master role.
Identified and tested workaround to mitigate this problem involves configuration tunings to speed up regions assignment on Region Servers and also reduce the time spent by BulkAssignment threads on the Master side. We do not recommend setting this configurations for normal operations.
- hbase.master.procedure.threads - This property defines the general procedure pool size, and in the context of this issue, is the pool for executing SCPs. Increasing this pool would allow more SCPs for different Region Servers to run in parallel, allowing for more regions assignments to be processed. However, on the core of this problem is the fact that Master may have multiple SCPs for same Region Server. These are not run in parallel, therefore tuning this parameter will not be sufficient. We recommend setting this parameter to the number of Region Servers in the cluster, so that under normal scenarios where there is one SCP for each Region Server, all those can run in parallel.
- hbase.bulk.assignment.perregion.open.time - This property determines how long a bulk assignment thread on Master's BulkAssigner should wait for all its regions to get assigned. Setting to a value as low as 100 (milliseconds) will allow the no-op SCPs to complete faster. Which will open up execution spots for SCPs that can do actual assignment work.
- hbase.master.namespace.init.timeout - The master has a time limit for how long it takes to assign the namespace table. Given that we want to limit master restarts, this is better adjusted upwards.
Since this issue is especially pronounced on clusters with a large number of regions-per-region-server the following additional config can also help:
- hbase.regionserver.executor.openregion.threads - This relates to the number of threads on each Region Server responsible for handling the region assignment requests. Provided individual Region Servers are not overloaded already tuning this value higher than the default (3) should help expedite region assignment.
Thanks to the ton of folks who helped diagnose, chase down, and document this issue, its reproduction, and the workaround. Especially Jean-Daniel Cryans, Wellington Chevreuil, Ankit Singhal, AMIT VIRMANI, Shamik Dave, Esteban Gutierrez, and Josh Elser.