[HBASE-25353] [Flakey Tests] branch-2 TestShutdownBackupMaster - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.0
Fix Version/s: 2.3.4, 2.5.0, 2.4.1
Component/s: flakies
Labels:
None

Hadoop Flags:

Reviewed

Description

Making this as a sub-issue of parent issue which fails similar to how we are failing now.

Currently, I see that that TestShutdownBackupMaster test passes usually but it is warped in how it completes. It will do all retries just before the test timesout at 13minutes max...: e.g. you'll see this...

2020-12-02 22:07:34,200 DEBUG [master/stack:0:becomeActiveMaster] client.ConnectionImplementation(1009): locateRegionInMeta parentTable='hbase:meta', attempt=44 of 46 failed; retrying after sleep of 46

... so we'll do all the retries and then complete so the test looks like it 'succeeded' but it actually ran for Total time: 12:41 min... and the log is full of thread dumps because the cluster won't go down (The time is spent in the test shutdown).

Often though, we won't complete the retries in time and the test fails. It is in the flakey list.

Rather, we are supposed to fail out fast when we are shutting down. Below is the type of retry we see.

2020-12-02 10:53:35,540 INFO [Listener at localhost/61609] util.JVMClusterUtil(348): Shutdown of 2 master(s) and 2 regionserver(s) complete
 2020-12-02 10:53:35,548 DEBUG [master/stack:0:becomeActiveMaster] client.ConnectionImplementation(1009): locateRegionInMeta parentTable='hbase:meta', attempt=2 of 46 failed; retrying after sleep of 46
 org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x1afa7f5b closed
 at org.apache.hadoop.hbase.client.ConnectionImplementation.checkClosed(ConnectionImplementation.java:630)
 at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:815)
 at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:138)
 at org.apache.hadoop.hbase.client.ConnectionImplementation.relocateRegion(ConnectionImplementation.java:803)
 at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.relocateRegion(ConnectionUtils.java:138)
 at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:933)
 at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:823)
 at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:138)
 at org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:64)
 at org.apache.hadoop.hbase.client.RegionLocator.getRegionLocation(RegionLocator.java:70)
 at org.apache.hadoop.hbase.client.RegionLocator.getRegionLocation(RegionLocator.java:59)
 at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:223)
 at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
 at org.apache.hadoop.hbase.client.HTable.get(HTable.java:383)
 at org.apache.hadoop.hbase.client.HTable.get(HTable.java:357)
 at org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:141)
 at org.apache.hadoop.hbase.master.TableNamespaceManager.isTableAvailableAndInitialized(TableNamespaceManager.java:278)
 at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:103)
 at org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:63)
 at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:249)
 at org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1224)
 at org.apache.hadoop.hbase.master.TestShutdownBackupMaster$MockHMaster.initClusterSchemaService(TestShutdownBackupMaster.java:68)
 at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1021)
 at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2082)
 at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:506)

See how a master is trying to become active and it won't relent trying to become active master even though this cluster is shutting down? See how we retry but the check for close of the connection is coming back with a DoNotRetryIOException? The exception is being swallowed. We keep going.

Fix looks simple enough.

Attachments

Issue Links

links to

GitHub Pull Request #2733

Activity

People

Assignee:: Michael Stack

Reporter:: Michael Stack

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 03/Dec/20 06:21

Updated:: 14/Dec/20 18:38

Resolved:: 05/Dec/20 22:26