[HBASE-23181] Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us" - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.2.1
Fix Version/s: 3.0.0-alpha-1, 2.3.0, 2.1.8, 2.2.3
Component/s: regionserver, wal
Labels:
None

Hadoop Flags:

Reviewed

Description

On a heavily loaded cluster, WAL count keeps rising and we can get into a state where we are not rolling the logs off fast enough. In particular, there is this interesting state at the extreme where we pick a region to flush because 'Too many WALs' but the region is actually not online. As the WAL count rises, we keep picking a region-to-flush that is no longer on the server. This condition blocks our being able to clear WALs; eventually WALs climb into the hundreds and the RS goes zombie with a full Call queue that starts throwing CallQueueTooLargeExceptions (bad if this servers is the one carrying hbase:meta): i.e. clients fail to access the RegionServer.

One symptom is a fast spike in WAL count for the RS. A restart of the RS will break the bind.

Here is how it looks in the log:

# Here is region closing....
2019-10-16 23:10:55,897 INFO org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed 8ee433ad59526778c53cc85ed3762d0b

....

# Then soon after ...
2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is not online on us
2019-10-16 23:11:45,006 INFO org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; count=45, max=32; forcing flush of 1 regions(s): 8ee433ad59526778c53cc85ed3762d0b

...
# Later...

2019-10-16 23:20:25,427 INFO org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; count=542, max=32; forcing flush of 1 regions(s): 8ee433ad59526778c53cc85ed3762d0b
2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is not online on us

I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version regularly that had ~~HBASE-16721~~ fix in it, but can't say yet if it was for same reason as above.

Attachments

Issue Links

relates to

HBASE-23157 WAL unflushed seqId tracking may wrong when Durability.ASYNC_WAL is used

Resolved

HBASE-16721 Concurrency issue in WAL unflushed seqId tracking

Closed

HBASE-23221 Polish the WAL interface after HBASE-23181

Resolved

links to

GitHub Pull Request #739

GitHub Pull Request #742

GitHub Pull Request #753

(1 links to)

Activity

People

Assignee:: Duo Zhang

Reporter:: Michael Stack

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 17/Oct/19 03:55

Updated:: 01/Nov/19 14:30

Resolved:: 26/Oct/19 15:02