From 79ecc9695a9598aee26b8f16da856fb5822d299e Mon Sep 17 00:00:00 2001 From: Michael Stack Date: Sun, 11 Mar 2018 13:15:16 -0700 Subject: [PATCH] HBASE-20173 [AMv2] DisableTableProcedure concurrent to ServerCrashProcedure can deadlock Allow that DisableTableProcedue can grab a region lock before ServerCrashProcedure can. Cater to this cricumstance where SCP was not unable to make progress by running the search for RIT against the crashed server a second time, post creation of all crashed-server assignemnts. The second run will uncover such as the above DisableTableProcedure unassign and will interrupt its suspend allowing both procedures to make progress. M hbase-protocol-shaded/src/main/protobuf/MasterProcedure.proto Add new procedure step post-assigns that reruns the RIT finder method. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java Make this important log more specific as to what is going on. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java Better explanation as to what is going on. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java Add extra step and run handleRIT a second time after we've queued up all SCP assigns. Also fix a but. SCP was adding an assign of a RIT that was actually trying to unassign (made the deadlock more likely). --- .../hbase/IntegrationTestDDLMasterFailover.java | 6 +-- .../src/main/protobuf/MasterProcedure.proto | 3 +- .../hbase/master/assignment/AssignmentManager.java | 2 +- .../hbase/master/assignment/UnassignProcedure.java | 13 +++--- .../master/procedure/RecoverMetaProcedure.java | 2 + .../master/procedure/ServerCrashProcedure.java | 48 +++++++++++++++------- 6 files changed, 49 insertions(+), 25 deletions(-) diff --git a/hbase-it/src/test/java/org/apache/hadoop/hbase/IntegrationTestDDLMasterFailover.java b/hbase-it/src/test/java/org/apache/hadoop/hbase/IntegrationTestDDLMasterFailover.java index d9a2f94cd4..4d0d7e0e97 100644 --- a/hbase-it/src/test/java/org/apache/hadoop/hbase/IntegrationTestDDLMasterFailover.java +++ b/hbase-it/src/test/java/org/apache/hadoop/hbase/IntegrationTestDDLMasterFailover.java @@ -53,12 +53,12 @@ import org.slf4j.LoggerFactory; /** * - * Integration test that verifies Procedure V2.

+ * Integration test that verifies Procedure V2. * * DDL operations should go through (rollforward or rollback) when primary master is killed by - * ChaosMonkey (default MASTER_KILLING)

+ * ChaosMonkey (default MASTER_KILLING). * - * Multiple Worker threads are started to randomly do the following Actions in loops:
+ *

Multiple Worker threads are started to randomly do the following Actions in loops: * Actions generating and populating tables: *