Details
Description
On a test cluster, this following events happened with ITBLL and CM leading to meta being unavailable until master is restarted.
An RS carrying meta died, and master assigned the region to one of the RSs.
2013-10-03 23:30:06,611 INFO [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] master.AssignmentManager: Assigning hbase:meta,,1.1588230740 to gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
2013-10-03 23:30:06,611 INFO [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] master.RegionStates: Transitioned {1588230740 state=OFFLINE, ts=1380843006601, server=null} to {1588230740 state=PENDING_OPEN, ts=1380843006611, server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}
2013-10-03 23:30:06,611 DEBUG [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] master.ServerManager: New admin connection to gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
At the same time, the RS that meta recently got assigned also died (due to CM), and restarted:
2013-10-03 23:30:07,636 DEBUG [RpcServer.handler=17,port=60000] master.ServerManager: REPORT: Server gs-hdp2-secure-1380781860-hbase-8.cs1cloud.internal,60020,1380843002494 came back up, removed it from the dead servers list 2013-10-03 23:30:08,769 INFO [RpcServer.handler=18,port=60000] master.ServerManager: Triggering server recovery; existingServer gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 looks stale, new server:gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 2013-10-03 23:30:08,771 DEBUG [RpcServer.handler=18,port=60000] master.AssignmentManager: Checking region=hbase:meta,,1.1588230740, zk server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 current=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820, matches=true 2013-10-03 23:30:08,771 DEBUG [RpcServer.handler=18,port=60000] master.ServerManager: Added=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 to dead servers, submitted shutdown handler to be executed meta=true 2013-10-03 23:30:08,771 INFO [RpcServer.handler=18,port=60000] master.ServerManager: Registering server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 2013-10-03 23:30:08,772 INFO [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] handler.MetaServerShutdownHandler: Splitting hbase:meta logs for gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
AM/SSH sees that the RS that died was carrying meta, but the assignment RPC request was still not sent:
2013-10-03 23:30:08,791 DEBUG [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] master.AssignmentManager: Checking region=hbase:meta,,1.1588230740, zk server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 current=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820, matches=true 2013-10-03 23:30:08,791 INFO [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] handler.MetaServerShutdownHandler: Server gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 was carrying META. Trying to assign. 2013-10-03 23:30:08,791 DEBUG [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] master.RegionStates: Offline 1588230740 with current state=PENDING_OPEN, expected state=OFFLINE/SPLITTING/MERGING 2013-10-03 23:30:08,791 INFO [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] master.RegionStates: Transitioned {1588230740 state=PENDING_OPEN, ts=1380843006611, server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820} to {1588230740 state=OFFLINE, ts=1380843008791, server=null} 2013-10-03 23:30:09,809 INFO [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] zookeeper.ZooKeeperNodeTracker: Unsetting hbase:meta region location in ZooKeeper
Our first attempt at the assign rpc fails, because the new server is now starting. The second attempt though succeeds:
2013-10-03 23:30:10,621 INFO [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] master.AssignmentManager: Assigning hbase:meta,,1.1588230740 to gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 2013-10-03 23:30:10,621 INFO [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] master.RegionStates: Transitioned {1588230740 state=OFFLINE, ts=1380843008791, server=null} to {1588230740 state=PENDING_OPEN, ts=1380843010621, server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820} 2013-10-03 23:30:10,621 DEBUG [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] master.ServerManager: New admin connection to gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 2013-10-03 23:30:10,622 DEBUG [RpcServer.handler=22,port=60000] master.ServerManager: REPORT: Server gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 came back up, removed it from the dead servers list 2013-10-03 23:30:10,934 DEBUG [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] master.AssignmentManager: Skip assigning {ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}, it is already {1588230740 state=PENDING_OPEN, ts=1380843010621, server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}
Note that the start time for the server does not match (1380842900820 is old , 1380843006362 is new).
The region server got the rpc to open the region, but failed to change the zk state, because ServerNames is not matching:
2013-10-03 23:30:10,601 INFO [Priority.RpcServer.handler=0,port=60020] regionserver.HRegionServer: Open hbase:meta,,1.1588230740 2013-10-03 23:30:10,897 DEBUG [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign: regionserver:60020-0x1417d489d9b0bd6 Transitioning 1588230740 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING 2013-10-03 23:30:10,918 WARN [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign: regionserver:60020-0x1417d489d9b0bd6 Attempt to transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the server that tried to transition was gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 not the expected gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 2013-10-03 23:30:10,918 WARN [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for region=1588230740 2013-10-03 23:30:10,919 WARN [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] handler.OpenRegionHandler: Region was hijacked? Opening cancelled for encodedName=1588230740 2013-10-03 23:30:10,919 INFO [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] handler.OpenRegionHandler: Opening of region {ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting version 0 2013-10-03 23:30:10,919 DEBUG [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign: regionserver:60020-0x1417d489d9b0bd6 Transitioning 1588230740 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_FAILED_OPEN 2013-10-03 23:30:10,921 WARN [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign: regionserver:60020-0x1417d489d9b0bd6 Attempt to transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to transition was gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 not the expected gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 2013-10-03 23:30:10,921 WARN [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] handler.OpenRegionHandler: Unable to mark region {ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} as FAILED_OPEN. It's likely that the master already timed out this open attempt, and thus another RS already has the region.
It seems that the RS behaved correct by not being able to open the region by transitioning the zk assignment node. However, the master fails to timeout the assignment even though the meta region is reported in RIT:
2013-10-04 00:14:50,658 DEBUG [gs-hdp2-secure-1380781860-hbase-12.cs1cloud.internal,60000,1380842679864-BalancerChore] master.HMaster: Not running balancer because 1 region(s) in transition: {1588230740={1588230740 state=PENDING_OPEN, ts=1380843010621, server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}}
Attachments
Attachments
- hbase-9721_v5-0.96.patch
- 61 kB
- Enis Soztutar
- hbase-9721_v5-0.98.patch
- 63 kB
- Enis Soztutar
- hbase-9721_v5.patch
- 67 kB
- Enis Soztutar
- hbase-9721_v4.patch
- 67 kB
- Enis Soztutar
- hbase-9721_v3.patch
- 66 kB
- Enis Soztutar
- hbase-9721_v2.patch
- 79 kB
- Enis Soztutar
- hbase-9721_v1.patch
- 75 kB
- Enis Soztutar
- hbase-9721_v0.patch
- 54 kB
- Enis Soztutar
Issue Links
- is related to
-
HBASE-10210 during master startup, RS can be you-are-dead-ed by master in error
- Closed
-
HBASE-10341 TestAssignmentManagerOnCluster fails occasionally
- Closed
- relates to
-
HBASE-8545 Meta stuck in transition when it is assigned to a just restarted dead region sever
- Closed
Activity
Nothing should rely on timeout as part of its logic, right? It was only there to catch bugs in 94
I agree with sershe. That detail provides a little extra state necessary for maintaining consistency of the assignment.
Here is a patch which adds ServerName to open and close region RPCs. The region server rejects the request if it is not the intended server to receive the request.
Will try to come up with a unit test tomorrow.
Attaching a more complete patch. As describes, this adds serverName to openRegion and closeRegion RPCs. The unit tests ensure that the AM fails over.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12619483/hbase-9721_v1.patch
against trunk revision .
ATTACHMENT ID: 12619483
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 15 new or modified tests.
+1 hadoop1.0. The patch compiles against the hadoop 1.0 profile.
+1 hadoop1.1. The patch compiles against the hadoop 1.1 profile.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 lineLengths. The patch does not introduce lines longer than 100
-1 site. The patch appears to cause mvn site goal to fail.
-1 core tests. The patch failed these unit tests:
org.apache.hadoop.hbase.util.TestHBaseFsck
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster
org.apache.hadoop.hbase.regionserver.TestRegionServerNoMaster
Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/8225//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8225//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8225//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8225//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8225//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8225//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8225//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8225//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8225//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8225//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/8225//console
This message is automatically generated.
Here is a patch which should fix the unit tests.
Run TestAMOnCluster 20 times.
This is closely related to HBASE-8545. jxiang it would be good if you take a look.
Sure I will take a look. As to openRegion/closeRegion, currently, it is using the data in ZK to confirm the right region server is doing it. Is that not good enough? Do we open the region without going through ZK?
it is using the data in ZK to confirm the right region server is doing it. Is that not good enough? Do we open the region without going through ZK?
What happens is this: the znode is created by the master, but using the ServerName with and earlier timestamp. If the RS accepts the RPC, he can open the region, but cannot change the znode later (to OPENED or FAILED_OPEN) because his ServerName does not match the ServerName on the znode. The RS cannot do anything because it cannot tell the master about this.
After RS accepts the RPC, should the RS check both the znode version and data before open the region? If so, we can avoid the problem without change the RPC, right?
I ran the test 50 times on my local box and got no issue. Do you run the test with the latest code?
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12619723/hbase-9721_v2.patch
against trunk revision .
ATTACHMENT ID: 12619723
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 21 new or modified tests.
+1 hadoop1.0. The patch compiles against the hadoop 1.0 profile.
+1 hadoop1.1. The patch compiles against the hadoop 1.1 profile.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 lineLengths. The patch does not introduce lines longer than 100
-1 site. The patch appears to cause mvn site goal to fail.
+1 core tests. The patch passed unit tests in .
Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/8237//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8237//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8237//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8237//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8237//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8237//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8237//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8237//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8237//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8237//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/8237//console
This message is automatically generated.
should the RS check both the znode version and data before open the region?
I think I prefer to put the sn (or just the startcode?) in the RPC as this patch does since we may do assignment without ZK later on.
I think I prefer to put the sn (or just the startcode?) in the RPC as this patch does since we may do assignment without ZK later on.
Agreed.
I ran the test 50 times on my local box and got no issue. Do you run the test with the latest code?
Yep, all the tests pass for me now.
Should we commit it?
I was wondering should we pass startcode only, or the full ServerName, since the hostname and the port are redundant. Other than that, I am fine.
I was wondering should we pass startcode only, or the full ServerName, since the hostname and the port are redundant
Sorry, forgot to reply to that. I think full ServerName is better in case it changes in the future or a new field is added etc. Plus it is not that much more data.
Using ServerName is really a good choice from nice/expandable interface point of view. However, from performance point of view, I think it is better to use just startcode. Although it is not much more data, it is pure overhead. It could have some impact if we are going to support lots of regions (and redundant regions for read availability). As I mentioned in HBASE-10210, I think we should make sure the startocode the real/only differentiate for region server instances running on the same host, port pair. Therefore, if ServerName gets new fields later on, those new fields should not be added so as to differentiate two region server instances running on the same host, port pair.
Another thing, for closeRegion, maybe we don't need to change anything, right? If it is a different instance running on the same host, port pair, the region must be not served there. AM can handle such NotServingRegionException properly.
It seems that the RS behaved correct by not being able to open the region by transitioning the zk assignment node. However, the master fails to timeout the assignment even though the meta region is reported in RIT.
In trunk, the timeout logic is off by default. This situation should be fixed by the meta SSH. Do you run the latest code in trunk/0.96? With this patch, any affected region should be assigned fast than before.
andrew.purtell@gmail.com can we include this in 0.98 as well? The final patch (v3) should be ready to go if jxiang gives the go.
Sorry this fell out of my radar.
Attaching v3 with RPC now only carrying serverStartCode instead of full serverName.
Jimmy, I think we should still keep the check in the closeRegion() call. Although unlikely, if the intended server is previous server, and the new server has the same region, the RS will still close the region unknowingly. This check ensures that no such race conditions can happen.
+1 for 0.98 if you guys agree the patch is in a committable state.
Looks good. Just one thing, for closeRegion, in AM#unassign (line around 3504), we don't handle DoNotRetryIOException. So it will retry and move region to fail_to_close state. What should we do here if we closeRegion?
What should we do here if we closeRegion?
I guess the AM will fail to close the region, but meanwhile the race condition will be gone, and next try we will pick up the correct server, no?
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12622727/hbase-9721_v3.patch
against trunk revision .
ATTACHMENT ID: 12622727
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 21 new or modified tests.
+1 hadoop1.0. The patch compiles against the hadoop 1.0 profile.
+1 hadoop1.1. The patch compiles against the hadoop 1.1 profile.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 lineLengths. The patch introduces the following lines longer than 100:
+ final ServerName server, final byte[] regionName, final boolean transitionInZK) throws IOException {
+ new java.lang.String[]
);
+ TEST_UTIL.getMiniHBaseCluster().startMaster(); //restart the master so that conf take into affect
+ AdminProtos.OpenRegionRequest orr = RequestConverter.buildOpenRegionRequest(getRS().getServerName(), hri, 0, null);
+ AdminProtos.OpenRegionRequest orr = RequestConverter.buildOpenRegionRequest(getRS().getServerName(), hri, 0, null);
+ AdminProtos.OpenRegionRequest orr = RequestConverter.buildOpenRegionRequest(getRS().getServerName(), hri, 0, null);
+ RequestConverter.buildCloseRegionRequest(getRS().getServerName(), regionName, 0, null, true);
+ CloseRegionRequest request = RequestConverter.buildCloseRegionRequest(earlierServerName, regionName, true);
+ Assert.assertTrue(se.getCause().getMessage().contains("This RPC was intended for a different server"));
+ AdminProtos.OpenRegionRequest orr = RequestConverter.buildOpenRegionRequest(earlierServerName, hri, 0, null);
-1 site. The patch appears to cause mvn site goal to fail.
+1 core tests. The patch passed unit tests in .
Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/8408//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8408//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8408//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8408//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8408//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8408//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8408//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8408//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8408//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8408//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/8408//console
This message is automatically generated.
+1. I agree it is a good thing to be sure we are talking to the right region server here.
SUCCESS: Integrated in HBase-0.98 #76 (See https://builds.apache.org/job/HBase-0.98/76/)
HBASE-9721 RegionServer should not accept regionOpen RPC intended for another(previous) server (enis: rev 1557917)
- /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java
- /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
- /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
- /hbase/branches/0.98/hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/AdminProtos.java
- /hbase/branches/0.98/hbase-protocol/src/main/protobuf/Admin.proto
- /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
- /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
- /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestScannersFromClientSide.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManagerOnCluster.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerNoMaster.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
FAILURE: Integrated in HBase-TRUNK #4813 (See https://builds.apache.org/job/HBase-TRUNK/4813/)
HBASE-9721 RegionServer should not accept regionOpen RPC intended for another(previous) server (enis: rev 1557914)
- /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java
- /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
- /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
- /hbase/trunk/hbase-protocol/src/main/java/com/google/protobuf/ZeroCopyLiteralByteString.java
- /hbase/trunk/hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/AdminProtos.java
- /hbase/trunk/hbase-protocol/src/main/protobuf/Admin.proto
- /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
- /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
- /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestScannersFromClientSide.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManagerOnCluster.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerNoMaster.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #71 (See https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/71/)
HBASE-9721 RegionServer should not accept regionOpen RPC intended for another(previous) server (enis: rev 1557917)
- /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java
- /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
- /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
- /hbase/branches/0.98/hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/AdminProtos.java
- /hbase/branches/0.98/hbase-protocol/src/main/protobuf/Admin.proto
- /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
- /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
- /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestScannersFromClientSide.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManagerOnCluster.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerNoMaster.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
SUCCESS: Integrated in HBase-TRUNK-on-Hadoop-1.1 #53 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-1.1/53/)
HBASE-9721 RegionServer should not accept regionOpen RPC intended for another(previous) server (enis: rev 1557914)
- /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java
- /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
- /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
- /hbase/trunk/hbase-protocol/src/main/java/com/google/protobuf/ZeroCopyLiteralByteString.java
- /hbase/trunk/hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/AdminProtos.java
- /hbase/trunk/hbase-protocol/src/main/protobuf/Admin.proto
- /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
- /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
- /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestScannersFromClientSide.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManagerOnCluster.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerNoMaster.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
This has caused persistent intermittent failures of TestAssignmentManagerOnCluster on 0.98 branch. See HBASE-10341. We can address the test failure on HBASE-10341 or I can revert this change on 0.98 branch.
TestAssignmentManagerOnCluster is also failing the same way on trunk so I am in favor of reverting this patch everywhere.
Log of a representative failed test run attached to HBASE-10341
Ok, let's do the revert. I'll inspect the test cases for finding out why they started failing sporadically.
SUCCESS: Integrated in HBase-0.98 #87 (See https://builds.apache.org/job/HBase-0.98/87/)
HBASE-9721 RegionServer should not accept regionOpen RPC intended for another(previous) server – REVERT (enis: rev 1558937)
- /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java
- /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
- /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
- /hbase/branches/0.98/hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/AdminProtos.java
- /hbase/branches/0.98/hbase-protocol/src/main/protobuf/Admin.proto
- /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
- /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
- /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestScannersFromClientSide.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManagerOnCluster.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerNoMaster.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
FAILURE: Integrated in HBase-TRUNK #4827 (See https://builds.apache.org/job/HBase-TRUNK/4827/)
HBASE-9721 RegionServer should not accept regionOpen RPC intended for another(previous) server – REVERT (enis: rev 1558935)
- /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java
- /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
- /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
- /hbase/trunk/hbase-protocol/src/main/java/com/google/protobuf/ZeroCopyLiteralByteString.java
- /hbase/trunk/hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/AdminProtos.java
- /hbase/trunk/hbase-protocol/src/main/protobuf/Admin.proto
- /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
- /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
- /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestScannersFromClientSide.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManagerOnCluster.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerNoMaster.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
SUCCESS: Integrated in HBase-0.98-on-Hadoop-1.1 #79 (See https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/79/)
HBASE-9721 RegionServer should not accept regionOpen RPC intended for another(previous) server – REVERT (enis: rev 1558937)
- /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java
- /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
- /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
- /hbase/branches/0.98/hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/AdminProtos.java
- /hbase/branches/0.98/hbase-protocol/src/main/protobuf/Admin.proto
- /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
- /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
- /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestScannersFromClientSide.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManagerOnCluster.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerNoMaster.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
SUCCESS: Integrated in HBase-TRUNK-on-Hadoop-1.1 #56 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-1.1/56/)
HBASE-9721 RegionServer should not accept regionOpen RPC intended for another(previous) server – REVERT (enis: rev 1558935)
- /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java
- /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
- /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
- /hbase/trunk/hbase-protocol/src/main/java/com/google/protobuf/ZeroCopyLiteralByteString.java
- /hbase/trunk/hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/AdminProtos.java
- /hbase/trunk/hbase-protocol/src/main/protobuf/Admin.proto
- /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
- /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
- /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestScannersFromClientSide.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManagerOnCluster.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerNoMaster.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
I would like to recommit this. Thanks to the logs Andrew attached at HBASE-10341, I was able to find the problem which caused flakiness in TestAssignmentManagerOnCluster.
Simply, the configuration change to set the max attempts to 20 at the start of the test did not take affect (because we are setting a different configuration object). So the max attempts of 3 is used in test testAssignRegionOnRestartedServer(). It fails occasionally if we select the same dead server out of 5 servers as the assignment target in all three attempts.
In patch v4, I corrected the conf usage, and set the attempts to 40 so that the test can only fail with a 1 / pow(5, 40) probability.
Previous:
2014-01-16 03:10:37,648 WARN [Thread-442] master.AssignmentManager(2002): Failed assignment of testAssignRegionOnRestartedServer,A,1389841837630.f6b232d66778f376e07a932d117c1b7b. to ip-10-234-61-203.us-west-2.compute.internal,47721,1389841830160, trying to assign elsewhere instead; try=3 of 3
2014-03-14 16:05:08,586 WARN [Thread-394] master.AssignmentManager(2012): Failed assignment of testAssignRegionOnRestartedServer,A,1394838308577.761b7b9c943ea9f71340a2bea385c53b. to 10.11.3.73,53463,1394838303235, trying to assign elsewhere instead; try=1 of 40 org.apache.hadoop.hbase.DoNotRetryIOException: org.apache.hadoop.hbase.DoNotRetryIOException: This RPC was intended for a different server with startCode: 1394838303235, this server is: 10.11.3.73,53463,1394838303335
Let's me run qa, and run the test 200 times locally. Will commit if they pass.
Andrew still ok with this going in 0.98.2?
Sure
Let's me run qa, and run the test 200 times locally. Will commit if they pass.
Please commit sooner than that, I've started running the unit tests pre-spin. Sooner rather than later would be great. Maybe 50 or 100 times?
Actually we may both have said 0.98.2 while meaning 0.98.1. At least I was really referring to 0.98.1
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12634862/hbase-9721_v4.patch
against trunk revision .
ATTACHMENT ID: 12634862
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 21 new or modified tests.
+1 hadoop1.0. The patch compiles against the hadoop 1.0 profile.
+1 hadoop1.1. The patch compiles against the hadoop 1.1 profile.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 lineLengths. The patch introduces the following lines longer than 100:
+ final ServerName server, final byte[] regionName, final boolean transitionInZK) throws IOException {
+ new java.lang.String[]
);
+ TEST_UTIL.getMiniHBaseCluster().startMaster(); //restart the master so that conf take into affect
+ AdminProtos.OpenRegionRequest orr = RequestConverter.buildOpenRegionRequest(getRS().getServerName(), hri, 0, null);
+ AdminProtos.OpenRegionRequest orr = RequestConverter.buildOpenRegionRequest(getRS().getServerName(), hri, 0, null);
+ AdminProtos.OpenRegionRequest orr = RequestConverter.buildOpenRegionRequest(getRS().getServerName(), hri, 0, null);
+ RequestConverter.buildCloseRegionRequest(getRS().getServerName(), regionName, 0, null, true);
+ CloseRegionRequest request = RequestConverter.buildCloseRegionRequest(earlierServerName, regionName, true);
+ Assert.assertTrue(se.getCause().getMessage().contains("This RPC was intended for a different server"));
+ AdminProtos.OpenRegionRequest orr = RequestConverter.buildOpenRegionRequest(earlierServerName, hri, 0, null);
+1 site. The mvn site goal succeeds with this patch.
-1 core tests. The patch failed these unit tests:
Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/9005//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9005//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9005//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9005//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9005//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9005//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9005//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9005//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9005//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9005//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/9005//console
This message is automatically generated.
Actually we may both have said 0.98.2 while meaning 0.98.1. At least I was really referring to 0.98.1
Sorry, I though you would cut the RC sooner. That's why I though it would not make into 0.98.1.
While running TestAssignmentManager and TestAssignmentManagerOnCluster 200 times, I ran into a similar issue in the newly added test (mainly 10 out of 10 tries chooses the dead server out of 2) . I've fixed that as well in v5. The test runs for those two tests are good now.
I'll commit this shortly to trunk. Do you want to include this before cut? I can commit to 0.98 if you want.
I'll commit this shortly to trunk. Do you want to include this before cut? I can commit to 0.98 if you want.
Yes, please do, thanks. Tagging tonight most likely.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12634943/hbase-9721_v5.patch
against trunk revision .
ATTACHMENT ID: 12634943
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 21 new or modified tests.
-1 patch. The patch command could not apply the patch.
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/9017//console
This message is automatically generated.
Attaching also v5-0.98 patch with import conflicts resolved for 0.98 branch. No other changes.
I think we can also have this in 0.96.2? saint.ack@gmail.com wdyt ?
Attaching v5-0.96.patch which applies. Ran the tests TesAssignmentManager* 20 times. It seems ok.
FAILURE: Integrated in HBase-TRUNK #5017 (See https://builds.apache.org/job/HBase-TRUNK/5017/)
HBASE-9721 RegionServer should not accept regionOpen RPC intended for another(previous) server (enis: rev 1577951)
- /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java
- /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
- /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
- /hbase/trunk/hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/AdminProtos.java
- /hbase/trunk/hbase-protocol/src/main/protobuf/Admin.proto
- /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
- /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
- /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestScannersFromClientSide.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManagerOnCluster.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerNoMaster.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
SUCCESS: Integrated in HBase-0.98 #238 (See https://builds.apache.org/job/HBase-0.98/238/)
HBASE-9721 RegionServer should not accept regionOpen RPC intended for another(previous) server (enis: rev 1577955)
- /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java
- /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
- /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
- /hbase/branches/0.98/hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/AdminProtos.java
- /hbase/branches/0.98/hbase-protocol/src/main/protobuf/Admin.proto
- /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
- /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
- /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestScannersFromClientSide.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManagerOnCluster.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerNoMaster.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
SUCCESS: Integrated in HBase-0.98-on-Hadoop-1.1 #222 (See https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/222/)
HBASE-9721 RegionServer should not accept regionOpen RPC intended for another(previous) server (enis: rev 1577955)
- /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java
- /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
- /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
- /hbase/branches/0.98/hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/AdminProtos.java
- /hbase/branches/0.98/hbase-protocol/src/main/protobuf/Admin.proto
- /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
- /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
- /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestScannersFromClientSide.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManagerOnCluster.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerNoMaster.java
- /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
FAILURE: Integrated in HBase-TRUNK-on-Hadoop-1.1 #120 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-1.1/120/)
HBASE-9721 RegionServer should not accept regionOpen RPC intended for another(previous) server (enis: rev 1577951)
- /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java
- /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
- /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
- /hbase/trunk/hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/AdminProtos.java
- /hbase/trunk/hbase-protocol/src/main/protobuf/Admin.proto
- /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
- /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
- /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestScannersFromClientSide.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManagerOnCluster.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerNoMaster.java
- /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
FAILURE: Integrated in hbase-0.96 #357 (See https://builds.apache.org/job/hbase-0.96/357/)
HBASE-9721 RegionServer should not accept regionOpen RPC intended for another(previous) server (stack: rev 1578601)
- /hbase/branches/0.96/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java
- /hbase/branches/0.96/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
- /hbase/branches/0.96/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
- /hbase/branches/0.96/hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/AdminProtos.java
- /hbase/branches/0.96/hbase-protocol/src/main/protobuf/Admin.proto
- /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
- /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
- /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
- /hbase/branches/0.96/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestScannersFromClientSide.java
- /hbase/branches/0.96/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
- /hbase/branches/0.96/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManagerOnCluster.java
- /hbase/branches/0.96/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
- /hbase/branches/0.96/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
- /hbase/branches/0.96/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerNoMaster.java
- /hbase/branches/0.96/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
SUCCESS: Integrated in hbase-0.96-hadoop2 #247 (See https://builds.apache.org/job/hbase-0.96-hadoop2/247/)
HBASE-9721 RegionServer should not accept regionOpen RPC intended for another(previous) server (stack: rev 1578601)
- /hbase/branches/0.96/hbase-client/src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java
- /hbase/branches/0.96/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
- /hbase/branches/0.96/hbase-client/src/main/java/org/apache/hadoop/hbase/protobuf/RequestConverter.java
- /hbase/branches/0.96/hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/AdminProtos.java
- /hbase/branches/0.96/hbase-protocol/src/main/protobuf/Admin.proto
- /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
- /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
- /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
- /hbase/branches/0.96/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestScannersFromClientSide.java
- /hbase/branches/0.96/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
- /hbase/branches/0.96/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManagerOnCluster.java
- /hbase/branches/0.96/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
- /hbase/branches/0.96/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
- /hbase/branches/0.96/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerNoMaster.java
- /hbase/branches/0.96/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
Forgot that we disable assignment timeouts by default now. The RPC is accepted by the new RS, but that RS failed to change the assignment znode, and just gave up assuming timeout.
It seems we can either send the ServerName together with the openRegion call, and reject the RPC from RS by comparing servernames, or allow the RS to change the assignment znode on failed open. I think the former is better.