[HBASE-7634] Replication handling of changes to peer clusters is inefficient - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.95.2
Fix Version/s: 0.98.0, 0.95.2
Component/s: Replication
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
This change has an impact on the number of watches set on the ${zookeeper.znode.parent}/rs node in ZK in a replication slave cluster (i.e. a cluster that is being replicated to). Every region server in each master cluster will place a watch on the rs node of each slave node. No additional configuration is necessary for this, but this could potentially have an impact the performance and/or hardware requirements of ZK on very large clusters.

Show
This change has an impact on the number of watches set on the ${zookeeper.znode.parent}/rs node in ZK in a replication slave cluster (i.e. a cluster that is being replicated to). Every region server in each master cluster will place a watch on the rs node of each slave node. No additional configuration is necessary for this, but this could potentially have an impact the performance and/or hardware requirements of ZK on very large clusters.

Description

The current handling of changes to the region servers in a replication peer cluster is currently quite inefficient. The list of region servers that are being replicated to is only updated if there are a large number of issues encountered while replicating.

This can cause it to take quite a while to recognize that a number of the regionserver in a peer cluster are no longer available. A potentially bigger problem is that if a replication peer cluster is started with a small number of regionservers, and then more region servers are added after replication has started, the additional region servers will never be used for replication (unless there are failures on the in-use regionservers).

Part of the current issue is that the retry code in ReplicationSource#shipEdits checks a randomly-chosen replication peer regionserver (in ReplicationSource#isSlaveDown) to see if it is up after a replication write has failed on a different randonly-chosen replication peer. If the peer is seen as not down, another randomly-chosen peer is used for writing.

A second part of the issue is that changes to the list of region servers in a peer cluster are not detected at all, and are only picked up if a certain number of failures have occurred when trying to ship edits.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-7634.v6.patch
31/Jul/13 11:38
35 kB
Gabriel Reid
HBASE-7634.v5.patch
31/Jul/13 08:17
35 kB
Gabriel Reid
HBASE-7634.v4.patch
30/Jul/13 15:08
33 kB
Gabriel Reid
HBASE-7634.v3.patch
25/Jan/13 09:12
32 kB
Gabriel Reid
HBASE-7634.v2.patch
24/Jan/13 21:29
32 kB
Gabriel Reid
HBASE-7634.patch
21/Jan/13 16:03
32 kB
Gabriel Reid

Issue Links

is related to

HBASE-9594 Add reference documentation on changes made by HBASE-7634 (Replication handling of peer cluster changes)

Closed

Activity

People

Assignee:: Gabriel Reid

Reporter:: Gabriel Reid

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 21/Jan/13 16:02

Updated:: 23/Sep/13 19:22

Resolved:: 01/Aug/13 17:04