Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.4.17, 2.5.5
-
None
Description
When running HBase-2.4.17, I met a NPE in regionserver log.
Reproduce
Config HBase cluster: 1 HMaster, 2 RS, 2.10.2 Hadoop.
Execute the following commands in the HMaster node using hbase shell,
create 'uuidd9efa97f93a442b686adae6d9f7bb2e9', {NAME => 'uuid099cbece77834a83a52bb0611c3ea080', VERSIONS => 3, COMPRESSION => 'NONE', BLOOMFILTER => 'ROWCOL', IN_MEMORY => 'false'}, {NAME => 'uuidbc1bea73952749329d7f025aab382c4e', VERSIONS => 2, COMPRESSION => 'GZ', BLOOMFILTER => 'ROW', IN_MEMORY => 'false'}, {NAME => 'uuidff292310d9dc450697af2bb25d9f3e98', VERSIONS => 2, COMPRESSION => 'GZ', BLOOMFILTER => 'NONE', IN_MEMORY => 'false'}, {NAME => 'uuid449de028da6b4d35be0f187ebec6c3be', VERSIONS => 2, COMPRESSION => 'GZ', BLOOMFILTER => 'ROW', IN_MEMORY => 'false'}, {NAME => 'uuidc0840c98f9d348a18f2d454c7a503b65', VERSIONS => 2, COMPRESSION => 'GZ', BLOOMFILTER => 'ROWCOL', IN_MEMORY => 'false'} create_namespace 'uuidec797633f5dd4ab9b96276135aeda9e2' create 'uuiddeb610fded9744889840ecd03dd18739', {NAME => 'uuid30a0f625ad454605908b60c932957ff0', VERSIONS => 1, COMPRESSION => 'GZ', BLOOMFILTER => 'ROW', IN_MEMORY => 'true'} incr 'uuidd9efa97f93a442b686adae6d9f7bb2e9', 'uuid46ddc3d3557e413e915e2393ae72c082', 'uuidbc1bea73952749329d7f025aab382c4e:JZycbUSpbDQmwgXinp', 1 flush 'uuidd9efa97f93a442b686adae6d9f7bb2e9', 'uuid449de028da6b4d35be0f187ebec6c3be' drop 'uuiddeb610fded9744889840ecd03dd18739' put 'uuidd9efa97f93a442b686adae6d9f7bb2e9', 'uuidf4704cae4d1e4661bd7664d26eb6b31b', 'uuidbc1bea73952749329d7f025aab382c4e:JZycbUSpbDQmwgXinp', 'XlPpFGvSYfcEXWXgwARytlSeiaSuHJFqpirMmLduqGnpdXLlHJWBumraXiifQSvHqNHmTcyzLQIvuQrkujPghfdtRkhOkgKEJHsAuAiMMeWZjdTHNZqhkOdJBOzsRYUXKOCNKeSxEDWgnKgsFDHMtxdnKKudBuceOgYmCrdaPXMclKkZKCIEiFDcdoAEJGKXYVfOjb' disable 'uuidd9efa97f93a442b686adae6d9f7bb2e9' drop 'uuidd9efa97f93a442b686adae6d9f7bb2e9' create 'uuid9d05a5cb34e64910ac90675186e7d0d4', {NAME => 'uuid1ce512a5997b4efea3bdead2e7f723c3', VERSIONS => 2, COMPRESSION => 'NONE', BLOOMFILTER => 'ROWCOL', IN_MEMORY => 'true'}, {NAME => 'uuid0b1baaa4275e46b2a3a1d11d6540fc30', VERSIONS => 2, COMPRESSION => 'NONE', BLOOMFILTER => 'NONE', IN_MEMORY => 'true'} put 'uuid9d05a5cb34e64910ac90675186e7d0d4', 'uuid552e42ade4c14099a1d8643bea1616d4', 'uuid1ce512a5997b4efea3bdead2e7f723c3:l', 1 drop 'uuid9d05a5cb34e64910ac90675186e7d0d4'
The exception will be thrown in either RS1 or RS2
2023-09-19 20:29:28,268 INFO [RS_OPEN_REGION-regionserver/hregion2:16020-2] handler.AssignRegionHandler: Opened uuid9d05a5cb34e64910ac90675186e7d0d4,,1695155367072.f59a0693a9469f9e1f131bf2aac1486d. 2023-09-19 20:29:29,205 ERROR [regionserver/hregion2:16020.Chore.1] hbase.ScheduledChore: Caught error java.lang.NullPointerException at org.apache.hadoop.hbase.quotas.QuotaCache$QuotaRefresherChore.updateQuotaFactors(QuotaCache.java:378) at org.apache.hadoop.hbase.quotas.QuotaCache$QuotaRefresherChore.chore(QuotaCache.java:224) at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:158) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:107) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)
Root Cause
The NPE is thrown at function: updateQuotaFactors()
private void updateQuotaFactors() { // Update machine quota factor ClusterMetrics clusterMetrics; try { clusterMetrics = rsServices.getConnection().getAdmin() .getClusterMetrics(EnumSet.of(Option.SERVERS_NAME, Option.TABLE_TO_REGIONS_COUNT)); } catch (IOException e) { LOG.warn("Failed to get cluster metrics needed for updating quotas", e); return; } int rsSize = clusterMetrics.getServersName().size(); if (rsSize != 0) { // TODO if use rs group, the cluster limit should be shared by the rs group machineQuotaFactor = 1.0 / rsSize; } Map<TableName, RegionStatesCount> tableRegionStatesCount = clusterMetrics.getTableRegionStatesCount(); // Update table machine quota factors for (TableName tableName : tableQuotaCache.keySet()) { double factor = 1; try { // BUGGY LINE long regionSize = tableRegionStatesCount.get(tableName).getOpenRegions(); if (regionSize == 0) { factor = 0; } else { int localRegionSize = rsServices.getRegions(tableName).size(); factor = 1.0 * localRegionSize / regionSize; } } catch (IOException e) { LOG.warn("Get table regions failed: {}", tableName, e); } tableMachineQuotaFactors.put(tableName, factor); } }
At line 378: the tableRegionStatesCount.get(tableName) return null and thus it runs into NPE.
The tableName leading to NPE is 'uuidd9efa97f93a442b686adae6d9f7bb2e9', which is disabled and dropped by the user.
long regionSize = tableRegionStatesCount.get(tableName).getOpenRegions();
The root cause is that when updating the quota factors, it iterates tableQuotaCache.keySet(), which might contain a table that has been dropped (Outdated cache).
Fix
Maybe we can add a check to make sure the table still exists in the system (Tmp solution). I have attached the full regionserver log and a simple fix.
This bug should also happen in 2.5.5 since the related code in QuotaCache remains the same (But I only tested 2.4.17).
Attachments
Attachments
Issue Links
- links to