[HBASE-5081] Distributed log splitting deleteNode races against splitLog retry - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.92.0, 0.94.0
Fix Version/s: 0.92.0
Component/s: wal
Labels:
None

Hadoop Flags:

Reviewed

Description

Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot.
I looked into it and here is what happened I think:

1. One rs died, the servershutdownhandler found it out and started the distributed log splitting;
2. All three tasks failed, so the three tasks were deleted, asynchronously;
3. Servershutdownhandler retried the log splitting;
4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks);
5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one
task in the hashmap;
6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not
in the hashmap, so it created a new orphan task.
7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short,
so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen.

So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have
a clean start.

Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen
soon enough, some node created during the retrial could be deleted.

deleteNode should be sync.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
05/Jan/12 22:36
38 kB
Prakash Khemani
0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
04/Jan/12 21:37
34 kB
Prakash Khemani
0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
04/Jan/12 18:49
33 kB
Prakash Khemani
0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
04/Jan/12 01:15
32 kB
Prakash Khemani
0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
03/Jan/12 23:59
29 kB
Prakash Khemani
0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
03/Jan/12 23:27
29 kB
Prakash Khemani
0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
03/Jan/12 22:07
29 kB
Prakash Khemani
5081-deleteNode-with-while-loop.txt
05/Jan/12 05:27
31 kB
Ted Yu
distributed_log_splitting_screen_shot2.png
06/Jan/12 03:57
125 kB
Jimmy Xiang
distributed_log_splitting_screenshot3.png
06/Jan/12 15:49
286 kB
Jimmy Xiang
distributed-log-splitting-screenshot.png
21/Dec/11 03:51
58 kB
Jimmy Xiang
hbase-5081_patch_for_92_v4.txt
21/Dec/11 17:41
2 kB
Jimmy Xiang
hbase-5081_patch_v5.txt
21/Dec/11 18:07
2 kB
Jimmy Xiang
HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
05/Jan/12 23:21
37 kB
Ted Yu
hbase-5081-patch-v6.txt
21/Dec/11 20:36
2 kB
Jimmy Xiang
hbase-5081-patch-v7.txt
22/Dec/11 00:33
4 kB
Jimmy Xiang
patch_for_92_v2.txt
21/Dec/11 17:02
2 kB
Jimmy Xiang
patch_for_92_v3.txt
21/Dec/11 17:12
2 kB
Jimmy Xiang
patch_for_92.txt
21/Dec/11 16:49
1 kB
Jimmy Xiang

Issue Links

is related to

HBASE-5136 Redundant MonitoredTask instances in case of distributed log splitting retry

Closed

Activity

People

Assignee:: Prakash Khemani

Reporter:: Jimmy Xiang

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 21/Dec/11 03:50

Updated:: 20/Nov/15 11:55

Resolved:: 06/Jan/12 19:39