Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Won't Fix
-
1.1.0
-
None
-
None
Description
In TestMetaWithReplicas, start and shutdown of mini cluster is done at start and end of every test in that class respectively, which makes the test class to take more time to complete. Instead we can start and stop the mini cluster only once per the class.
Attachments
Attachments
- HBASE-13659.patch
- 3 kB
- Ashish Singhi
- HBASE-13659-branch-1.1.patch
- 5 kB
- Ashish Singhi
- HBASE-13659-branch-1.1-v1.patch
- 4 kB
- Ashish Singhi
- org.apache.hadoop.hbase.client.TestMetaWithReplicas-output.txt
- 655 kB
- Nick Dimiduk
Activity
In build #14007 it took 1min 50secs to complete the test where as the patch build i.e., #14008 took 1min 3secs.
Hi ashish singhi I applied your patch here to master, works fine. Brought it back to branch-1 and I'm seeing it consistently hang.
From jstack
"main" prio=5 tid=0x00007fe8e980b800 nid=0x1903 waiting on condition [0x000000010cd5e000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.hbase.util.Threads.sleep(Threads.java:146) at org.apache.hadoop.hbase.MiniHBaseCluster.waitForActiveAndReadyMaster(MiniHBaseCluster.java:485) at org.apache.hadoop.hbase.HBaseCluster.waitForActiveAndReadyMaster(HBaseCluster.java:205) at org.apache.hadoop.hbase.client.TestMetaWithReplicas.shutdownMetaAndDoValidations(TestMetaWithReplicas.java:221) at org.apache.hadoop.hbase.client.TestMetaWithReplicas.testShutdownHandling(TestMetaWithReplicas.java:145)
From the test logs, I see
2015-06-17 10:53:09,602 WARN [main] regionserver.HRegionServer(2063): Unable to report fatal error to master com.google.protobuf.ServiceException: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Call to /10.0.0.110:50399 failed on local exception: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Connection to /10.0.0.110:50399 is closing. Call id=47, waitTime=1 at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:224) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:288) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.reportRSFatalError(RegionServerStatusProtos.java:9006) at org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2060) at org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.abortRegionServer(MiniHBaseCluster.java:174) at org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.access$200(MiniHBaseCluster.java:108) at org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer$2.run(MiniHBaseCluster.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:356) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594) at org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:306) at org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.abort(MiniHBaseCluster.java:165) at org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2072) at org.apache.hadoop.hbase.regionserver.HRegionServer.kill(HRegionServer.java:2087) at org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.kill(MiniHBaseCluster.java:161) at org.apache.hadoop.hbase.MiniHBaseCluster.killRegionServer(MiniHBaseCluster.java:246) at org.apache.hadoop.hbase.client.TestMetaWithReplicas.shutdownMetaAndDoValidations(TestMetaWithReplicas.java:201) at org.apache.hadoop.hbase.client.TestMetaWithReplicas.testShutdownHandling(TestMetaWithReplicas.java:145)
Looks like it's getting stuck replaying logs to recover a killed RS, meanwhile master just hangs waiting for minimum number of RS's to rejoin cluster.
Attaching test run log.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12740165/org.apache.hadoop.hbase.client.TestMetaWithReplicas-output.txt
against master branch at commit 623fd63827b2953c150597f24c7205737119bebe.
ATTACHMENT ID: 12740165
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 179 new or modified tests.
-1 patch. The patch command could not apply the patch.
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14451//console
This message is automatically generated.
Thanks ndimiduk for looking into this.
testShutdownHandling was failing because as you said here. But then checked why we have only 2 online RS in the cluster ? I found that in testShutdownOfReplicaHolder we are killing a RS but not starting it back. So now we are left with only 2 RS online in the cluster but master will keep on wait for 3(minimum) RS to become online.
Attached patch for branch-1.1.
But looks like it is not failing in master branch but better we can commit the same branch-1.1 patch in master branch also.
Please review.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12740315/HBASE-13659-branch-1.1.patch
against branch-1.1 branch at commit 41d9e8d9b4895d0711f006d926a39e5ae3bd7c9d.
ATTACHMENT ID: 12740315
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 4 new or modified tests.
+1 hadoop versions. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0)
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 protoc. The applied patch does not increase the total number of protoc compiler warnings.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 checkstyle. The applied patch does not increase the total number of checkstyle errors
+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 lineLengths. The patch does not introduce lines longer than 100
+1 site. The mvn post-site goal succeeds with this patch.
+1 core tests. The patch passed unit tests in .
Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/14452//testReport/
Release Findbugs (version 2.0.3) warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14452//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/14452//artifact/patchprocess/checkstyle-aggregate.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14452//console
This message is automatically generated.
I have attached another patch v1 for branch-1.1. In this instead of starting RS again at the end of testShutdownOfReplicaHolder I have set the conf hbase.master.wait.on.regionservers.mintostart to 2. With this all the tests in this class are passing 5/5 times. With the earlier patch for branch-1.1 there were some tests which were flakey
java.lang.AssertionError: null at org.apache.hadoop.hbase.client.TestMetaWithReplicas.testMetaAddressChange(TestMetaWithReplicas.java:368) testHBaseFsckWithMetaReplicas(org.apache.hadoop.hbase.client.TestMetaWithReplicas) Time elapsed: 0.234 sec <<< FAILURE! java.lang.AssertionError: expected:<[]> but was:<[MULTI_META_REGION, UNKNOWN]> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.hadoop.hbase.util.hbck.HbckTestingUtil.assertNoErrors(HbckTestingUtil.java:91) at org.apache.hadoop.hbase.client.TestMetaWithReplicas.testHBaseFsckWithMetaReplicas(TestMetaWithReplicas.java:279) testHBaseFsckWithExcessMetaReplicas(org.apache.hadoop.hbase.client.TestMetaWithReplicas) Time elapsed: 1.29 sec <<< FAILURE! java.lang.AssertionError: expected:<[UNKNOWN, SHOULD_NOT_BE_DEPLOYED]> but was:<[UNKNOWN, SHOULD_NOT_BE_DEPLOYED, MULTI_META_REGION]> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.hadoop.hbase.util.hbck.HbckTestingUtil.assertErrors(HbckTestingUtil.java:99) at org.apache.hadoop.hbase.client.TestMetaWithReplicas.testHBaseFsckWithExcessMetaReplicas(TestMetaWithReplicas.java:412) testHBaseFsckWithFewerMetaReplicas(org.apache.hadoop.hbase.client.TestMetaWithReplicas) Time elapsed: 1.265 sec <<< FAILURE! java.lang.AssertionError: expected:<[UNKNOWN, NO_META_REGION]> but was:<[UNKNOWN, NO_META_REGION, SHOULD_NOT_BE_DEPLOYED, MULTI_META_REGION]> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.hadoop.hbase.util.hbck.HbckTestingUtil.assertErrors(HbckTestingUtil.java:99) at org.apache.hadoop.hbase.client.TestMetaWithReplicas.testHBaseFsckWithFewerMetaReplicas(TestMetaWithReplicas.java:292) testMetaLookupThreadPoolCreated(org.apache.hadoop.hbase.client.TestMetaWithReplicas) Time elapsed: 1.301 sec <<< ERROR! org.apache.hadoop.hbase.TableNotFoundException: Table 'testMetaLookupThreadPoolCreated' was not found, got: hbase:namespace. at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1274) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1155) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.relocateRegion(ConnectionManager.java:1126) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.relocateRegion(ConnectionManager.java:1110) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.relocateRegion(ConnectionManager.java:1132) at org.apache.hadoop.hbase.client.TestMetaWithReplicas.testMetaLookupThreadPoolCreated(TestMetaWithReplicas.java:234)
I could not find out the fix for this as I am not much aware of read replica feature
devaraj, enis, ndimiduk... can you help on this ?
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12740364/HBASE-13659-branch-1.1-v1.patch
against branch-1.1 branch at commit 51b606cd185437802f0a7a4620f1434e8e2d9c74.
ATTACHMENT ID: 12740364
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 4 new or modified tests.
+1 hadoop versions. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0)
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 protoc. The applied patch does not increase the total number of protoc compiler warnings.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 checkstyle. The applied patch does not increase the total number of checkstyle errors
+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 lineLengths. The patch does not introduce lines longer than 100
-1 site. The patch appears to cause mvn post-site goal to fail.
-1 core tests. The patch failed these unit tests:
org.apache.hadoop.hbase.mapreduce.TestImportExport
org.apache.hadoop.hbase.util.TestProcessBasedCluster
Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/14460//testReport/
Release Findbugs (version 2.0.3) warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14460//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/14460//artifact/patchprocess/checkstyle-aggregate.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14460//console
This message is automatically generated.
Still there is a flakey test
Flaked tests: org.apache.hadoop.hbase.client.TestMetaWithReplicas.testMetaAddressChange(org.apache.hadoop.hbase.client.TestMetaWithReplicas) Run 1: TestMetaWithReplicas.testMetaAddressChange:370 null Run 2: PASS
ashish singhi I think your patch makes things better than they were, certainly the conf.setInt(ServerManager.WAIT_ON_REGIONSERVERS_MINTOSTART, 2); bit.
I'm seeing consistent failure at like 370 as well. Adding some extra logging,
2015-07-01 17:36:34,023 INFO [main] client.TestMetaWithReplicas(350): testMetaAddressChange -- i think meta is on 10.0.0.110,59059,1435797366643 ... 2015-07-01 17:36:35,686 INFO [main] client.TestMetaWithReplicas(367): testMetaAddressChange -- sending move request of 1588230740 to 10.0.0.110,58926,1435797319567 2015-07-01 17:36:35,687 DEBUG [B.defaultRpcServer.handler=0,queue=0,port=59142] master.HMaster(1402): Skipping move of region hbase:meta,,1.1588230740 because region already assigned to the same server 10.0.0.110,58926,1435797319567.
In between here and there there's no mention of 1588230740. This is failing consistently for me locally.
Let me take a look at it later tonight. I think when I wrote the tests I deliberately did it this way - start/shutdown cluster for each test, maybe because I was mucking around with ZK. Not sure why..
adding
RegionLocator metaLoc = TEST_UTIL.getConnection().getRegionLocator(TableName.META_TABLE_NAME); LOG.info("testMetaAddressChange -- metaLocator says " + metaLoc.getRegionLocation(null).getServerName().getServerName());
It now seems the way the test parses meta location and the location returned by RegionLocator instance disagree. Maybe RegionLocator is not looking specifically for the first replica?
2015-07-01 17:57:53,135 INFO [main] client.TestMetaWithReplicas(340): testMetaAddressChange -- starting test. 2015-07-01 17:57:53,136 INFO [main] client.TestMetaWithReplicas(350): testMetaAddressChange -- parsed meta location is 10.0.0.110,59702,1435798645712 2015-07-01 17:57:53,136 INFO [main] client.TestMetaWithReplicas(352): testMetaAddressChange -- metaLocator says 10.0.0.110,59570,1435798598482
Test looks for meta with
ZooKeeperWatcher zkw = TEST_UTIL.getZooKeeperWatcher(); String baseZNode = conf.get(HConstants.ZOOKEEPER_ZNODE_PARENT, HConstants.DEFAULT_ZOOKEEPER_ZNODE_PARENT); String primaryMetaZnode = ZKUtil.joinZNode(baseZNode, conf.get("zookeeper.znode.metaserver", "meta-region-server"));
while ZooKeeperWatcher appears to use
str = ZKUtil.joinZNode(baseZNode, conf.get("zookeeper.znode.metaserver", "meta-region-server") + "-" + replicaId);
Looking at logic in MetaTableLocator, it seems to specify a default replicaId of 1, which means it'll always be going to a "-" + replicaId location instead of the bare location used in the test.
I've run this test without the patch a bunch of times locally and it's passing consistently. I guess the test depends on resetting the cluster each round.
I guess the test depends on resetting the cluster each round
I think so too but it has been a while since I wrote those tests..
I have closed this as Won't Fix, if it should be something else just let me know.
Thanks.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12731937/HBASE-13659.patch
against master branch at commit 9aeafe30b7d932e562f803fd071812cd27aebaf8.
ATTACHMENT ID: 12731937
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 4 new or modified tests.
+1 hadoop versions. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0)
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 protoc. The applied patch does not increase the total number of protoc compiler warnings.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 checkstyle. The applied patch does not increase the total number of checkstyle errors
+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 lineLengths. The patch does not introduce lines longer than 100
+1 site. The mvn site goal succeeds with this patch.
+1 core tests. The patch passed unit tests in .
Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/14008//testReport/
Release Findbugs (version 2.0.3) warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14008//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/14008//artifact/patchprocess/checkstyle-aggregate.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14008//console
This message is automatically generated.