[HDFS-9075] Multiple datacenter replication inside one HDFS cluster - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: datanode, namenode
Labels:
None

Description

It is common scenario of deploying multiple datacenter for scaling and disaster tolerant.
In this case we certainly want that data can be shared transparently (to user) across datacenters.

For example, say we have a raw user action log stored daily, different computations may take place with the log as input. As scale grows, we may want to schedule various kind of computations in more than one datacenter.

As far as i know, current solution is to deploy multiple independent clusters corresponding to datacenters, using distcp to sync data files between them.
But in this case, user needs to know exactly where data is stored, and mistakes may be made during human-intervened operations. After all, it is basically a computer job.

Based on these facts, it is obvious that a multiple datacenter replication solution may solve the scenario.

I am working one prototype that works with 2 datacenters, the goal is to provide data replication between datacenters transparently and minimize the inter-dc bandwidth usage. Basic idea is replicate blocks to both DC and determine number of replications by historical statistics of access behaviors of that part of namespace.

I will post a design document soon.

Attachments

Issue Links

relates to

HDFS-1432 HDFS across data centers: HighTide

Open

HDFS-5442 Zero loss HDFS data replication for multiple datacenters

Open

Activity

People

Assignee:: He Tianyi

Reporter:: He Tianyi

Votes:: 2 Vote for this issue

Watchers:: 23 Start watching this issue

Dates

Created:: 14/Sep/15 08:16

Updated:: 16/Sep/15 02:12