[HBASE-9047] Tool to handle finishing replication when the cluster is offline - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.96.0
Fix Version/s: 0.98.0, 0.94.15, 0.96.2, 0.99.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

We're having a discussion on the mailing list about replicating the data on a cluster that was shut down in an offline fashion. The motivation could be that you don't want to bring HBase back up but still need that data on the slave.

So I have this idea of a tool that would be running on the master cluster while it is down, although it could also run at any time. Basically it would be able to read the replication state of each master region server, finish replicating what's missing to all the slave, and then clear that state in zookeeper.

The code that handles replication does most of that already, see ReplicationSourceManager and ReplicationSource. Basically when ReplicationSourceManager.init() is called, it will check all the queues in ZK and try to grab those that aren't attached to a region server. If the whole cluster is down, it will grab all of them.

The beautiful thing here is that you could start that tool on all your machines and the load will be spread out, but that might not be a big concern if replication wasn't lagging since it would take a few seconds to finish replicating the missing data for each region server.

I'm guessing when starting ReplicationSourceManager you'd give it a fake region server ID, and you'd tell it not to start its own source.

FWIW the main difference in how replication is handled between Apache's HBase and Facebook's is that the latter is always done separately of HBase itself. This jira isn't about doing that.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-9047-0.94.9-v0.PATCH
12/Sep/13 23:10
23 kB
Demai Ni
HBASE-9047-0.94-v1.patch
13/Dec/13 04:14
21 kB
Demai Ni
HBASE-9047-trunk-v0.patch
12/Sep/13 20:29
23 kB
Demai Ni
HBASE-9047-trunk-v1.patch
22/Sep/13 04:16
22 kB
Demai Ni
HBASE-9047-trunk-v2.patch
26/Oct/13 22:26
21 kB
Demai Ni
HBASE-9047-trunk-v3.patch
30/Oct/13 19:30
21 kB
Demai Ni
HBASE-9047-trunk-v4.patch
07/Nov/13 18:39
21 kB
Michael Stack
HBASE-9047-trunk-v4.patch
06/Nov/13 22:30
21 kB
Demai Ni
HBASE-9047-trunk-v5.patch
06/Dec/13 02:13
21 kB
Demai Ni
HBASE-9047-trunk-v6.patch
10/Dec/13 03:20
21 kB
Demai Ni
HBASE-9047-trunk-v7.patch
12/Dec/13 19:44
21 kB
Demai Ni
HBASE-9047-trunk-v7.patch
10/Dec/13 18:45
21 kB
Demai Ni

Issue Links

relates to

HBASE-10249 TestReplicationSyncUpTool fails because failover takes too long

Closed

Activity

People

Assignee:: Demai Ni

Reporter:: Jean-Daniel Cryans

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 26/Jul/13 20:08

Updated:: 05/Jan/16 09:27

Resolved:: 17/Dec/13 05:51