[HBASE-24779] Improve insight into replication WAL readers hung on checkQuota - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.0.0-alpha-1, 2.4.0
Component/s: Replication
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
New metrics are exposed, on the global source, for replication which indicate the "WAL entry buffer" that was introduced in ~~HBASE-15995~~. When this usage reaches the limit, that RegionServer will cease to read more data for the sake of trying to replicate it. This usage (and limit) is local to each RegionServer is shared across all peers being handled by that RegionServer.

Show
New metrics are exposed, on the global source, for replication which indicate the "WAL entry buffer" that was introduced in HBASE-15995 . When this usage reaches the limit, that RegionServer will cease to read more data for the sake of trying to replicate it. This usage (and limit) is local to each RegionServer is shared across all peers being handled by that RegionServer.

Description

Helped a customer this past weekend who, with a large number of RegionServers, has some RegionServers which replicated data to a peer without issues while other RegionServers did not.

The number of queue logs varied over the past 24hrs in the same manner. Some spikes in queued logs into 100's of logs, but other times, only 1's-10's of logs were queued.

We were able to validate that there were "good" and "bad" RegionServers by creating a test table, assigning it to a regionserver, enabling replication on that table, and validating if the local puts were replicated to a peer. On a good RS, data was replicated immediately. On a bad RS, data was never replicated (at least, on the order of 10's of minutes which we waited).

On the "bad RS", we were able to observe that the wal-reader thread(s) on that RS were spending time in a Thread.sleep() in a different location than the other. Specifically it was sitting in the ReplicationSourceWALReader#checkQuota()'s sleep call, not the handleEmptyWALBatch() method on the same class.

My only assumption is that, somehow, these RegionServers got into a situation where they "allocated" memory from the quota but never freed it. Then, because the WAL reader thinks it has no free memory, it blocks indefinitely and there are no pending edits to ship and (thus) free that memory. A cursory glance at the code gives me a lot of anxiety around places where we don't properly clean it up (e.g. batches that fail to ship, dropping a peer). As a first stab, let me add some more debugging so we can actually track this state properly for the operators and their sanity.

Attachments

Issue Links

is related to

HBASE-24834 TestReplicationSource.testWALEntryFilter failing in branch-2+

Resolved

relates to

HBASE-20417 Do not read wal entries when peer is disabled

Resolved

HBASE-15995 Separate replication WAL reading from shipping

Closed

HBASE-24813 ReplicationSource should clear buffer usage on ReplicationSourceManager upon termination

Resolved

HBASE-25003 Backport HBASE-24350 and HBASE-24779 to branch-2.2 & branch-2.3

Resolved

links to

GitHub Pull Request #2193

(1 links to)

Activity

People

Assignee:: Josh Elser

Reporter:: Josh Elser

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 27/Jul/20 16:00

Updated:: 12/Nov/20 21:53

Resolved:: 07/Aug/20 22:51