[HBASE-25612] HMaster should abort if ReplicationLogCleaner is not able to delete oldWALs. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.0
Fix Version/s: 1.7.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

In our production cluster, we encountered an issue where the number of files within /hbase/oldWALs directory were growing exponentially from about 4000 baseline to 150000 and growing at the rate of 333 files per minute.

On further investigation we found that ReplicatonLogCleaner thread was getting aborted since it was not able to talk to zookeeper. Stack trace below

2021-02-25 23:05:01,149 WARN [an-pool3-thread-1729] zookeeper.ZKUtil - replicationLogCleaner-0x3000002e05e0d8f, quorum=zookeeper-0:2181,zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181,zookeeper-4:2181, baseZNode=/hbase Unable to get data of znode /hbase/replication/rs
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/replication/rs
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
 at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1229)
 at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:374)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:713)
 at org.apache.hadoop.hbase.replication.ReplicationQueuesClientZKImpl.getQueuesZNodeCversion(ReplicationQueuesClientZKImpl.java:87)
 at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.loadWALsFromQueues(ReplicationLogCleaner.java:99)
 at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.getDeletableFiles(ReplicationLogCleaner.java:70)
 at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:262)
 at org.apache.hadoop.hbase.master.cleaner.CleanerChore.access$200(CleanerChore.java:52)
 at org.apache.hadoop.hbase.master.cleaner.CleanerChore$3.act(CleanerChore.java:413)
 at org.apache.hadoop.hbase.master.cleaner.CleanerChore$3.act(CleanerChore.java:410)
 at org.apache.hadoop.hbase.master.cleaner.CleanerChore.deleteAction(CleanerChore.java:481)
 at org.apache.hadoop.hbase.master.cleaner.CleanerChore.traverseAndDelete(CleanerChore.java:410)
 at org.apache.hadoop.hbase.master.cleaner.CleanerChore.access$100(CleanerChore.java:52)
 at org.apache.hadoop.hbase.master.cleaner.CleanerChore$1.run(CleanerChore.java:220)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

2021-02-25 23:05:01,149 WARN  [an-pool3-thread-1729] master.ReplicationLogCleaner - ReplicationLogCleaner received abort, ignoring.  Reason: Failed to get stat of replication rs node
2021-02-25 23:05:01,149 DEBUG [an-pool3-thread-1729] master.ReplicationLogCleaner - org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/replication/rs
2021-02-25 23:05:01,150 WARN  [an-pool3-thread-1729] master.ReplicationLogCleaner - Failed to read zookeeper, skipping checking deletable files

2021-02-25 23:05:01,149 WARN [an-pool3-thread-1729] master.ReplicationLogCleaner - ReplicationLogCleaner received abort, ignoring. Reason: Failed to get stat of replication rs node

This line is more scary where HMaster invoked Abortable but just ignored and HMaster was doing it business as usual.

We have max files per directory configuration in namenode which is set to 1M in our clusters. If this directory reached that limit then that would have brought down the whole cluster.

We shouldn't ignore Abortable and should crash the Hmaster if Abortable is invoked.

Attachments

Issue Links

links to

GitHub Pull Request #3027

GitHub Pull Request #3222

Activity

People

Assignee:: Rushabh Shah

Reporter:: Rushabh Shah

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 26/Feb/21 15:48

Updated:: 07/May/21 15:45

Resolved:: 06/May/21 15:30