Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      HBase should consider supporting a federated deployment where someone might have terascale (or beyond) clusters in more than one geography and would want the system to handle replication between the clusters/regions. It would be sweet if HBase had something on the roadmap to sync between replicas out of the box.

      Consider if rows, columns, or even cells could be scoped: local, or global.

      Then, consider a background task on each cluster that replicates new globally scoped edits to peer clusters. The HBase/Bigtable data model has convenient features (timestamps, multiversioning) such that simple exchange of globally scoped cells would be conflict free and would "just work". Implementation effort here would be in producing an efficient mechanism for collecting up edits from all the HRS and transmitting the edits over the network to peers where they would then be split out to the HRS there. Holding on to the edit trace and tracking it until the remote commits succeed would also be necessary. So, HLog is probably the right place to set up the tee. This would be filtered log shipping, basically.

      This proposal does not consider transactional tables. For transactional tables, enforcement of global mutation commit ordering would come into the picture if the user wants the transaction to span the federation. This should be an optional feature even with transactional tables themselves being optional because of how slow it would be.

      1. hbase_repl.3.pdf
        246 kB
        Andrew Purtell
      2. hbase_repl.3.odp
        151 kB
        Andrew Purtell

        Issue Links

          Activity

          Hide
          Jean-Daniel Cryans added a comment -

          Glad to see that you already did some work on this issue Andrew, once 0.20 is out I'm sure people will look for a feature like that. Regards the slides:

          • I like the idea of having nominated gateways.
          • Using the WALs seems the way to go, especially the way you describe how it should be done.
          • In the case a cluster is down, I guess that would mean the other clusters would have to keep all the WALs until the it is up again. At that moment, it may receive tons of WALs right?

          Also I was wondering, if you want to add a new cluster, would the way to go be replicating by hand (MR or else) all the data to the other cluster then telling somehow that the clusters have a new peer?

          Show
          Jean-Daniel Cryans added a comment - Glad to see that you already did some work on this issue Andrew, once 0.20 is out I'm sure people will look for a feature like that. Regards the slides: I like the idea of having nominated gateways. Using the WALs seems the way to go, especially the way you describe how it should be done. In the case a cluster is down, I guess that would mean the other clusters would have to keep all the WALs until the it is up again. At that moment, it may receive tons of WALs right? Also I was wondering, if you want to add a new cluster, would the way to go be replicating by hand (MR or else) all the data to the other cluster then telling somehow that the clusters have a new peer?
          Hide
          Andrew Purtell added a comment - - edited

          In the case a cluster is down, I guess that would mean the other clusters would have to keep all the WALs until the it is up again. At that moment, it may receive tons of WALs right?

          Yes the effect of a partition and extended outage is a buildup of WALs on the peer clusters, and then a lot of backlog. Let me think about this case and post a revised slide deck.

          Also I was wondering, if you want to add a new cluster, would the way to go be replicating by hand (MR or else) all the data to the other cluster then telling somehow that the clusters have a new peer?

          I was anticipating that the cluster would be advertised as a peer, somehow, and then replication would then start. The replicators should add tables and column families to their local schema on demand as the cells are received, maybe additionally also ask the peer about schema details as necessary. Updates to HTD and HCD attributes can be considered another type of edit. Whether or not to bring over existing data would be a deployment/application concern I think and could be handed by a MR export-import job.

          Show
          Andrew Purtell added a comment - - edited In the case a cluster is down, I guess that would mean the other clusters would have to keep all the WALs until the it is up again. At that moment, it may receive tons of WALs right? Yes the effect of a partition and extended outage is a buildup of WALs on the peer clusters, and then a lot of backlog. Let me think about this case and post a revised slide deck. Also I was wondering, if you want to add a new cluster, would the way to go be replicating by hand (MR or else) all the data to the other cluster then telling somehow that the clusters have a new peer? I was anticipating that the cluster would be advertised as a peer, somehow, and then replication would then start. The replicators should add tables and column families to their local schema on demand as the cells are received, maybe additionally also ask the peer about schema details as necessary. Updates to HTD and HCD attributes can be considered another type of edit. Whether or not to bring over existing data would be a deployment/application concern I think and could be handed by a MR export-import job.
          Hide
          ryan rawson added a comment -

          We could use ZooKeeper paths as a way for replication endpoints to know about each other.

          Show
          ryan rawson added a comment - We could use ZooKeeper paths as a way for replication endpoints to know about each other.
          Hide
          Jonathan Gray added a comment -

          This looks great Andrew! Some comments...

          • What do we do when there are two identical keys in a KeyValue (row, family, column, timestamp) but different values? That's actually going to be possible in 0.20 since you can manually set the stamp, will certainly be possible with multi-master replication. I'm not sure how it's handled now. Would depend on logic in both memcache insertion and more importantly compaction, and then how it's handled when reading.
          • Everything is now just a KeyValue, so that would be what we send to replicas.
          • Thoughts on network partitioning? I'm assuming you're referring to partitioning of replica clusters from one another, not within a cluster right? If so, I guess you'd hang on to WALs as long as you could, eventually a replicated cluster would go into some secondary mode of needing a full sync (when other cluster(s) could no longer hold all WALs, or should we assume hdfs will not fill and just flush, so can always resync with WALs?). (note: handling of intra-cluster partitions is virtually impossible because of our strong consistency)
          • Regarding SCOPE and setting things as local or replicated. What do you suspect the precision/resolution of this would be? Could i have some tables being replicated to some clusters, other tables to others, some to both?
          • Would replicas of tables always require identical family settings? For example, I have a cluster of 5 nodes with lots of memory, I want to just replicate a single high-volume, high-read table from my primary large cluster. But in the small cluster I want to set a TTL of 1 day and also set as in-memory. This is kind of advanced and special but the ability to do things like that would be very cool, could definitely see us doing something like it were it possible.

          I've got a good bit of experience with database replication, did some work in the postgres world on WAL shipping. Let me know how I can help your effort.

          I agree on your assessment regarding consistency, etc. It is clear we should be doing an eventual consistency model for replication. This is one of my favorite topics!

          One thing that's a bit special is this would make an HBase cluster of clusters a "read-your-writes"-style eventual consistency distribution model (with our strong consistency, partitioned distribution within each individual cluster a la read-your-writes). That makes a huge difference for us, internally, on many of our data systems. This may be obvious as we're just talking about replication here, but something to keep in mind.

          Show
          Jonathan Gray added a comment - This looks great Andrew! Some comments... What do we do when there are two identical keys in a KeyValue (row, family, column, timestamp) but different values? That's actually going to be possible in 0.20 since you can manually set the stamp, will certainly be possible with multi-master replication. I'm not sure how it's handled now. Would depend on logic in both memcache insertion and more importantly compaction, and then how it's handled when reading. Everything is now just a KeyValue, so that would be what we send to replicas. Thoughts on network partitioning? I'm assuming you're referring to partitioning of replica clusters from one another, not within a cluster right? If so, I guess you'd hang on to WALs as long as you could, eventually a replicated cluster would go into some secondary mode of needing a full sync (when other cluster(s) could no longer hold all WALs, or should we assume hdfs will not fill and just flush, so can always resync with WALs?). (note: handling of intra-cluster partitions is virtually impossible because of our strong consistency) Regarding SCOPE and setting things as local or replicated. What do you suspect the precision/resolution of this would be? Could i have some tables being replicated to some clusters, other tables to others, some to both? Would replicas of tables always require identical family settings? For example, I have a cluster of 5 nodes with lots of memory, I want to just replicate a single high-volume, high-read table from my primary large cluster. But in the small cluster I want to set a TTL of 1 day and also set as in-memory. This is kind of advanced and special but the ability to do things like that would be very cool, could definitely see us doing something like it were it possible. I've got a good bit of experience with database replication, did some work in the postgres world on WAL shipping. Let me know how I can help your effort. I agree on your assessment regarding consistency, etc. It is clear we should be doing an eventual consistency model for replication. This is one of my favorite topics! One thing that's a bit special is this would make an HBase cluster of clusters a "read-your-writes"-style eventual consistency distribution model (with our strong consistency, partitioned distribution within each individual cluster a la read-your-writes). That makes a huge difference for us, internally, on many of our data systems. This may be obvious as we're just talking about replication here, but something to keep in mind.
          Hide
          Billy Pearson added a comment -

          I was thanking on this there is some other thing to consider like table splits will the regions be the same on both because there is no guarantee the compactions will happen at the same time or the split will find the same mid key.

          I would thank the master would be the idea process to pull logs a pass to peer master then it can split the logs in to regions and pass the edits on to the servers hosting the regions.
          I would like to see Sequential process of the edits to the peer so everything is in the same order and that's the way we store the wal's now.

          I am not sure what the current status of appends on hdfs right now but if we had that 100% working the master could just remember where in the wal it read up to and pull every x secs to see if there are any updates then we would not have to worry about waiting for a log to roll which could be a while in some cases. Waiting for a log to roll for the updates to get pushed to the peers seams like the wrong way to go with this but might be the only way we have now if append is not working right in hdfs.

          As for a first sync for the peers would be hugh saving if we could do a rolling read only mode on the regions and flush the memcache and copy the needed files unlock the region and start the transfer to the peer this would allow one by one copy of the regions to the remote and it would only be depending on the site-site bandwidth as the bottleneck in the mean time the peer could be holding edits and waiting for all regions to get copied and then start the replay of the logs skipping any edit that is older the the time stamp of the copy. I thank that could be written in the hfile now I thank as meta data.

          Just some suggestions and/or other thoughts

          Show
          Billy Pearson added a comment - I was thanking on this there is some other thing to consider like table splits will the regions be the same on both because there is no guarantee the compactions will happen at the same time or the split will find the same mid key. I would thank the master would be the idea process to pull logs a pass to peer master then it can split the logs in to regions and pass the edits on to the servers hosting the regions. I would like to see Sequential process of the edits to the peer so everything is in the same order and that's the way we store the wal's now. I am not sure what the current status of appends on hdfs right now but if we had that 100% working the master could just remember where in the wal it read up to and pull every x secs to see if there are any updates then we would not have to worry about waiting for a log to roll which could be a while in some cases. Waiting for a log to roll for the updates to get pushed to the peers seams like the wrong way to go with this but might be the only way we have now if append is not working right in hdfs. As for a first sync for the peers would be hugh saving if we could do a rolling read only mode on the regions and flush the memcache and copy the needed files unlock the region and start the transfer to the peer this would allow one by one copy of the regions to the remote and it would only be depending on the site-site bandwidth as the bottleneck in the mean time the peer could be holding edits and waiting for all regions to get copied and then start the replay of the logs skipping any edit that is older the the time stamp of the copy. I thank that could be written in the hfile now I thank as meta data. Just some suggestions and/or other thoughts
          Hide
          Andrew Purtell added a comment -

          @Billy:

          I don't follow what you are saying about splits. It won't matter where a table is split. The replication process does not care about such details. It would send edits from the WALs to peers to be applied as if some local HRS is receiving local batchupdates.

          Also, according to the proposal, the master would not be involved in replication. The proposal considers more than one HRS – self-elected via ZK – working in a fault tolerant way to forward edits sent by all the other HRS on to the peer cluster. There is no SPOF in the replication process.

          Also, I disagree that waiting for HLog roll is the wrong way to go. There is no reason a log roll cannot happen once per minute or every five minutes or whatever the configured replication period is. Then, we do not care if append or sync is properly implemented in the underlying FS. Given the state of how those issues are progressing in HDFS, we may have a working replication process before HDFS has a working append.

          Show
          Andrew Purtell added a comment - @Billy: I don't follow what you are saying about splits. It won't matter where a table is split. The replication process does not care about such details. It would send edits from the WALs to peers to be applied as if some local HRS is receiving local batchupdates. Also, according to the proposal, the master would not be involved in replication. The proposal considers more than one HRS – self-elected via ZK – working in a fault tolerant way to forward edits sent by all the other HRS on to the peer cluster. There is no SPOF in the replication process. Also, I disagree that waiting for HLog roll is the wrong way to go. There is no reason a log roll cannot happen once per minute or every five minutes or whatever the configured replication period is. Then, we do not care if append or sync is properly implemented in the underlying FS. Given the state of how those issues are progressing in HDFS, we may have a working replication process before HDFS has a working append.
          Hide
          Billy Pearson added a comment -

          ok I agree now I reread the pdf the only down side is the lag will be the log roll period for the most part but I can live with that if others can.

          Show
          Billy Pearson added a comment - ok I agree now I reread the pdf the only down side is the lag will be the log roll period for the most part but I can live with that if others can.
          Show
          Andrew Purtell added a comment - See https://issues.apache.org/jira/browse/HBASE-1411?focusedCommentId=12708624&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12708624
          Hide
          Billy Pearson added a comment -

          I assume to get a new peer online we would have to have some kind of a read-lock with a flush or
          some kind of catchup mode that will export all data before x timestamp

          Might include something in the pdf on idea you are thanking of on deploying a new peer

          Also will this be a write one place read anywhere replication or write anywhere read anywhere replication
          Will edits be able to write to any site/cluster and get replication to all the peers?

          Show
          Billy Pearson added a comment - I assume to get a new peer online we would have to have some kind of a read-lock with a flush or some kind of catchup mode that will export all data before x timestamp Might include something in the pdf on idea you are thanking of on deploying a new peer Also will this be a write one place read anywhere replication or write anywhere read anywhere replication Will edits be able to write to any site/cluster and get replication to all the peers?
          Hide
          Andrew Purtell added a comment -

          I assume to get a new peer online we would have to have some kind of a read-lock with a flush or some kind of catchup mode that will export all data before x timestamp

          In my opinion, no. Edits are propagated from HLogs. Existing data would not be replicated, therefore. It would be an application specific consideration, and could be accomplished perhaps of forwarding of existing data at the application level via mapreduce transfer job or a background value fetch-and-refresh strategy.

          Also will this be a write one place read anywhere replication or write anywhere read anywhere replication

          Write anywhere read anywhere

          Will edits be able to write to any site/cluster and get replication to all the peers?

          Yes.
          But I know that jgray at least would like for some administrative control over the propagation details. This could be accomplished via local settings stored in each cluster's peer table, supplied as parameters to ADD PEER commands.

          Show
          Andrew Purtell added a comment - I assume to get a new peer online we would have to have some kind of a read-lock with a flush or some kind of catchup mode that will export all data before x timestamp In my opinion, no. Edits are propagated from HLogs. Existing data would not be replicated, therefore. It would be an application specific consideration, and could be accomplished perhaps of forwarding of existing data at the application level via mapreduce transfer job or a background value fetch-and-refresh strategy. Also will this be a write one place read anywhere replication or write anywhere read anywhere replication Write anywhere read anywhere Will edits be able to write to any site/cluster and get replication to all the peers? Yes. But I know that jgray at least would like for some administrative control over the propagation details. This could be accomplished via local settings stored in each cluster's peer table, supplied as parameters to ADD PEER commands.
          Hide
          Andrew Purtell added a comment -

          Break out replication into contrib/.

          Put hooks into core for holding onto logs after roll, notify watchers via some mechanism (e.g. ZK), and only allow log delete after all watchers acknowledge it. Everything else, move out.

          Show
          Andrew Purtell added a comment - Break out replication into contrib/. Put hooks into core for holding onto logs after roll, notify watchers via some mechanism (e.g. ZK), and only allow log delete after all watchers acknowledge it. Everything else, move out.
          Hide
          Andrew Purtell added a comment -

          This issue is an umbrella for ongoing work now, move out from any specific release.

          Show
          Andrew Purtell added a comment - This issue is an umbrella for ongoing work now, move out from any specific release.
          Hide
          Andrew Purtell added a comment -

          Should we close the umbrella? We've had a multi data center replication feature of some capability in several major releases now.

          Show
          Andrew Purtell added a comment - Should we close the umbrella? We've had a multi data center replication feature of some capability in several major releases now.
          Hide
          Jean-Daniel Cryans added a comment -

          You are right Andrew, closing this.

          Show
          Jean-Daniel Cryans added a comment - You are right Andrew, closing this.
          Hide
          Andrew Purtell added a comment -

          Thanks for all the work you put into this Jean-Daniel Cryans

          Show
          Andrew Purtell added a comment - Thanks for all the work you put into this Jean-Daniel Cryans

            People

            • Assignee:
              Unassigned
              Reporter:
              Andrew Purtell
            • Votes:
              13 Vote for this issue
              Watchers:
              45 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development