[HBASE-16721] Concurrency issue in WAL unflushed seqId tracking - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.0.0, 1.1.0, 1.2.0
Fix Version/s: 1.3.0, 1.2.4, 1.1.8, 2.0.0
Component/s: wal
Labels:
None

Hadoop Flags:

Reviewed
Release Note:
Fixed a bug in sequenceId tracking for the WALs that caused WAL files to accumulate without being deleted due to a rare race condition.

Description

I'm inspecting an interesting case where in a production cluster, some regionservers ends up accumulating hundreds of WAL files, even with force flushes going on due to max logs. This happened multiple times on the cluster, but not on other clusters. The cluster has periodic memstore flusher disabled, however, this still does not explain why the force flush of regions due to max limit is not working. I think the periodic memstore flusher just masks the underlying problem, which is why we do not see this in other clusters.

The problem starts like this:

2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] regionserver.LogRoller: Failed to schedule flush of d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null

then, it continues until the RS is restarted:

2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] regionserver.LogRoller: Failed to schedule flush of d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null

The problem is that region d4cf39dc40ea79f5da4d0cf66d03cb1f is already split some time ago, and was able to flush its data and split without any problems. However, the FSHLog still thinks that there is some unflushed data for this region.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hbase-16721_v2.master.patch
29/Sep/16 02:11
6 kB
Enis Soztutar
hbase-16721_v2.branch-1.patch
29/Sep/16 01:49
8 kB
Enis Soztutar
hbase-16721_v1.branch-1.patch
28/Sep/16 21:56
2 kB
Enis Soztutar
hbase-16721_addendum2.branch-1.1.patch
13/Oct/16 00:20
2 kB
Enis Soztutar
hbase-16721_addendum2.branch-1.1.patch
13/Oct/16 18:42
2 kB
Enis Soztutar
hbase-16721_addendum2.branch-1.1.patch
14/Oct/16 20:53
2 kB
Enis Soztutar
hbase-16721_addendum.patch
01/Oct/16 21:58
0.5 kB
Enis Soztutar

Issue Links

is related to

HBASE-23181 Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"

Resolved

relates to

HBASE-16820 BulkLoad mvcc visibility only works accidentally

Resolved

HBASE-23157 WAL unflushed seqId tracking may wrong when Durability.ASYNC_WAL is used

Resolved

Activity

People

Assignee:: Enis Soztutar

Reporter:: Enis Soztutar

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 28/Sep/16 00:33

Updated:: 19/Oct/19 15:18

Resolved:: 17/Oct/16 21:32