Description
In Class ServerManager, onlineServers is declared as ConcurrentHashMap. In findServerWithSameHostnamePortWithLock(), it has to do a loop to find if there is a ServerName with same host:port pair. If replaced with ConcurrentSkipListMap, findServerWithSameHostnamePortWithLock() can be replaced with a O(logN) implementation.
I run some performance comparison(test the function only), it seems that there is no difference if there are 1000 servers. With more servers, ConcurrentSkipListMap implementation is going to win big.
Attachments
Attachments
- HBASE-16275-v001.patch
- 2 kB
- Hua Xiang
- HBASE-16275-v002.patch
- 1 kB
- Hua Xiang
- TestServerManagerPerf.java
- 7 kB
- Hua Xiang
Issue Links
- depends upon
-
HBASE-16272 Overflow in ServerName's compareTo method
- Resolved
Activity
Checking TestProcedureManager in local, we will see below exception:
2016-07-23 12:17:05,583 FATAL [M:0;localhost:57006] master.HMaster(2154): Unhandled: inconsistent range java.lang.IllegalArgumentException: inconsistent range at java.util.concurrent.ConcurrentSkipListMap$SubMap.<init>(ConcurrentSkipListMap.java:2506) at java.util.concurrent.ConcurrentSkipListMap.subMap(ConcurrentSkipListMap.java:1984) at org.apache.hadoop.hbase.master.ServerManager.findServerWithSameHostnamePortWithLock(ServerManager.java:448) at org.apache.hadoop.hbase.master.ServerManager.checkAndRecordNewServer(ServerManager.java:357) at org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:267) at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerStartup(MasterRpcServices.java:269) at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2362) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:946)
Checking the code, the problem is that ServerName#compareTo will do a force cast when comparing startcode:
return (int)(this.getStartcode() - other.getStartcode());
When we cast the result of 0 - Long.MAX_VALUE to int, it will return 1, making fromKey larger than toKey when calling CSLM#subMap. Simply test
long x = 0L - Long.MAX_VALUE; System.out.println(x); System.out.println((int) x);
and we could see below result:
-9223372036854775807 1
Please resolve the above problem (well, to be fair, this actually is a bug of ServerName#compareTo, but still...) and update the patch huaxiang, thanks.
the compareTo() fix is part of HBASE-16272. it was committed yesterday. so this run did not contain it
Thanks carp84 and mbertozzi. As Matteo mentioned, the compareTo() fix was committed just recently. This diff is the initial patch, work in progress.
Thanks for the reference of HBASE-16272 mbertozzi, will wait for the rebased patch then huaxiang.
As you said, ‘’it seems that there is no difference if there are 1000 servers“”, so how many servers will the patch make some difference. As far as I know, the biggest HBase cluster is still no more than several thousands nodes. So I think this fix is trival
allan163, Yeah, I do not numbers yet, still working on some real performance testing, will share more later.
Some numbers to share, tested with number of unique servers. The test program is attached.
Server count 100 200 500 1000 2000 3000 5000 WOPatch 8(52) 13(228) 22(498) 47(1176) 150(3518) 309(7970) 855(20756) WithPatch 8(102) 11(202) 16(341) 23(548) 34(823) 47(1114) 65(2192)
the format is like N(M), N is the max time spent by one thread in Milliseconds, M is the total time spent by all threads. It gives rough idea.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
+1 | hbaseanti | 0m 0s | Patch does not have any anti-patterns. |
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
-1 | test4tests | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. |
+1 | mvninstall | 4m 50s | master passed |
+1 | compile | 1m 1s | master passed with JDK v1.8.0 |
+1 | compile | 0m 49s | master passed with JDK v1.7.0_80 |
+1 | checkstyle | 1m 8s | master passed |
+1 | mvneclipse | 0m 29s | master passed |
+1 | findbugs | 2m 48s | master passed |
+1 | javadoc | 0m 36s | master passed with JDK v1.8.0 |
+1 | javadoc | 0m 44s | master passed with JDK v1.7.0_80 |
+1 | mvninstall | 1m 0s | the patch passed |
+1 | compile | 0m 59s | the patch passed with JDK v1.8.0 |
+1 | javac | 0m 59s | the patch passed |
+1 | compile | 0m 51s | the patch passed with JDK v1.7.0_80 |
+1 | javac | 0m 51s | the patch passed |
+1 | checkstyle | 1m 5s | the patch passed |
+1 | mvneclipse | 0m 28s | the patch passed |
+1 | whitespace | 0m 0s | Patch has no whitespace issues. |
+1 | hadoopcheck | 43m 19s | Patch does not cause any errors with Hadoop 2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. |
+1 | findbugs | 3m 0s | the patch passed |
+1 | javadoc | 0m 40s | the patch passed with JDK v1.8.0 |
+1 | javadoc | 0m 49s | the patch passed with JDK v1.7.0_80 |
-1 | unit | 112m 57s | hbase-server in the patch failed. |
+1 | asflicense | 0m 21s | Patch does not generate ASF License warnings. |
178m 29s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures |
Timed out junit tests | org.apache.hadoop.hbase.regionserver.TestRegionMergeTransactionOnCluster |
org.apache.hadoop.hbase.TestAcidGuarantees | |
org.apache.hadoop.hbase.regionserver.TestHRegionOnCluster | |
org.apache.hadoop.hbase.mapreduce.TestSecureLoadIncrementalHFiles | |
org.apache.hadoop.hbase.mapreduce.TestTableSnapshotInputFormat |
Subsystem | Report/Notes |
---|---|
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12820257/HBASE-16275-v002.patch |
JIRA Issue | |
Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile |
uname | Linux penates.apache.org 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh |
git revision | master / bcf409e |
Default Java | 1.7.0_80 |
Multi-JDK versions | /usr/local/jenkins/java/jdk1.8.0:1.8.0 /home/jenkins/jenkins-slave/tools/hudson.model.JDK/JDK_1.7_latest_:1.7.0_80 |
findbugs | v3.0.0 |
unit | https://builds.apache.org/job/PreCommit-HBASE-Build/2780/artifact/patchprocess/patch-unit-hbase-server.txt |
unit test logs | https://builds.apache.org/job/PreCommit-HBASE-Build/2780/artifact/patchprocess/patch-unit-hbase-server.txt |
Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/2780/testReport/ |
modules | C: hbase-server U: hbase-server |
Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/2780/console |
Powered by | Apache Yetus 0.2.1 http://yetus.apache.org |
This message was automatically generated.
Thanks ted_yu for review, let me check the failed test case and report back.
I run TestSnapshotFromMaster, without my patch, it also failed for me, created
HBASE-16293 and taking a look now.
+1 on v2, checked all failed UT on local with patch and confirmed all passed, also checked TestServerManagerPerf with 2000 nodes on local (mac pro) and confirmed there's performance gain (not that much, but probably due to my poor machine...).
FAILURE: Integrated in HBase-1.4 #310 (See https://builds.apache.org/job/HBase-1.4/310/)
HBASE-16275 Change ServerManager#onlineServers from ConcurrentHashMap to (matteo.bertozzi: rev 7983d2f45b073d6e36f0623e142c83e7500f0774)
- hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
FAILURE: Integrated in HBase-Trunk_matrix #1313 (See https://builds.apache.org/job/HBase-Trunk_matrix/1313/)
HBASE-16275 Change ServerManager#onlineServers from ConcurrentHashMap to (matteo.bertozzi: rev 66fb697f816cc94030079ce9f3c60aea5b3a6a0c)
- hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #1251 (See https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/1251/)
HBASE-16275 Change ServerManager#onlineServers from ConcurrentHashMap to (apurtell: rev 903e7d5f7a71055636b9ff8a8b48038659de7572)
- hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
FAILURE: Integrated in HBase-0.98-matrix #379 (See https://builds.apache.org/job/HBase-0.98-matrix/379/)
HBASE-16275 Change ServerManager#onlineServers from ConcurrentHashMap to (apurtell: rev 903e7d5f7a71055636b9ff8a8b48038659de7572)
- hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
HBASE-16275This message was automatically generated.