[HBASE-22289] WAL-based log splitting resubmit threshold may result in a task being stuck forever - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.1.0, 1.5.0
Fix Version/s: 2.0.6, 2.1.5
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

Not sure if this is handled better in procedure based WAL splitting; in any case it affects versions before that.
The problem is not in ZK as such but in internal state tracking in master, it seems.

Master:

2019-04-21 01:49:49,584 INFO  [master/<master>:17000.splitLogManager..Chore.1] coordination.SplitLogManagerCoordination: Resubmitting task <path>.1555831286638

worker-rs, split fails

....
2019-04-21 02:05:31,774 INFO  [RS_LOG_REPLAY_OPS-regionserver/<worker-rs>:17020-1] wal.WALSplitter: Processed 24 edits across 2 regions; edits skipped=457; log file=<path>.1555831286638, length=2156363702, corrupted=false, progress failed=true

Master (not sure about the delay of the acquired-message; at any rate it seems to detect the failure fine from this server)

2019-04-21 02:11:14,928 INFO  [main-EventThread] coordination.SplitLogManagerCoordination: Task <path>.1555831286638 acquired by <worker-rs>,17020,1555539815097
2019-04-21 02:19:41,264 INFO  [master/<master>:17000.splitLogManager..Chore.1] coordination.SplitLogManagerCoordination: Skipping resubmissions of task <path>.1555831286638 because threshold 3 reached

After that this task is stuck in the limbo forever with the old worker, and never resubmitted.
RS never logs anything else for this task.
Killing the RS on the worker unblocked the task and some other server did the split very quickly, so seems like master doesn't clear the worker name in its internal state when hitting the threshold... master never restarted so restarting the master might have also cleared it.
This is extracted from splitlogmanager log messages, note the times.

2019-04-21 02:2   1555831286638=last_update = 1555837874928 last_version = 11 cur_worker_name = <worker-rs>,17020,1555539815097 status = in_progress incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20, 
....
2019-04-22 11:1   1555831286638=last_update = 1555837874928 last_version = 11 cur_worker_name = <worker-rs>,17020,1555539815097 status = in_progress incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20}

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-22289.01-branch-2.1.patch
22/Apr/19 22:18
4 kB
Sergey Shelukhin
HBASE-22289.02-branch-2.1.patch
24/Apr/19 01:56
5 kB
Sergey Shelukhin
HBASE-22289.03-branch-2.1.patch
25/Apr/19 21:45
5 kB
Sergey Shelukhin
HBASE-22289.branch-2.1.001.patch
18/May/19 21:14
6 kB
Michael Stack
HBASE-22289.branch-2.1.001.patch
18/May/19 17:24
6 kB
Michael Stack
HBASE-22289.branch-2.1.001.patch
18/May/19 02:32
6 kB
Michael Stack

Issue Links

links to

Review Board (branch-2.1)

Activity

People

Assignee:: Sergey Shelukhin

Reporter:: Sergey Shelukhin

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 22/Apr/19 20:16

Updated:: 21/May/19 04:35

Resolved:: 20/May/19 16:26