[HBASE-12865] WALs may be deleted before they are replicated to peers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.98.14, 1.0.2, 1.2.0, 1.1.2, 1.3.0, 2.0.0
Component/s: Replication
Labels:
None

Hadoop Flags:

Reviewed

Description

By design, ReplicationLogCleaner guarantee that the WALs being in replication queue can't been deleted by the HMaster. The ReplicationLogCleaner gets the WAL set from zookeeper by scanning the replication zk node. But it may get uncompleted WAL set during replication failover for the scan operation is not atomic.

For example: There are three region servers: rs1, rs2, rs3, and peer id 10. The layout of replication zookeeper nodes is:

/hbase/replication/rs/rs1/10/wals
                     /rs2/10/wals
                     /rs3/10/wals

t1: the ReplicationLogCleaner finished scanning the replication queue of rs1, and start to scan the queue of rs2.
t2: region server rs3 is down, and rs1 take over rs3's replication queue. The new layout is

/hbase/replication/rs/rs1/10/wals
                     /rs1/10-rs3/wals
                     /rs2/10/wals
                     /rs3

t3, the ReplicationLogCleaner finished scanning the queue of rs2, and start to scan the node of rs3. But the the queue has been moved to "replication/rs1/10-rs3/WALS"

So the ReplicationLogCleaner will miss the WALs of rs3 in peer 10 and the hmaster may delete these WALs before they are replicated to peer clusters.

We encountered this problem in our cluster and I think it's a serious bug for replication.

Suggestions are welcomed to fix this bug. thx~

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-12865-V1.diff
30/Jun/15 11:20
13 kB
He Liangliang
HBASE-12865-V2.diff
01/Jul/15 11:21
13 kB
He Liangliang

Activity

People

Assignee:: He Liangliang

Reporter:: Shaohui Liu

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 15/Jan/15 04:21

Updated:: 31/Aug/15 22:39

Resolved:: 07/Aug/15 22:14