Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 1.7.0
    • Component/s: None
    • Labels:
      None

      Description

      The use case here is where people have multiple data centers and need to replicate the data in between them. Accumulo can model this replication after the way that HBase currently handles the replication as detailed here (http://hbase.apache.org/replication.html).

      There will be one master Cluster and multiple slave clusters. Accumulo will use the Master-Push model to replicate the statements from the master clusters WAL to the various slaves WALs.

        Issue Links

          Activity

          Hide
          ASF subversion and git services added a comment -

          Commit 5fd07ec03059daa21758404de0c059a2dd5c395a in accumulo's branch refs/heads/ACCUMULO-378 from Josh Elser
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=5fd07ec ]

          Merge branch 'master' into ACCUMULO-378

          Show
          ASF subversion and git services added a comment - Commit 5fd07ec03059daa21758404de0c059a2dd5c395a in accumulo's branch refs/heads/ ACCUMULO-378 from Josh Elser [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=5fd07ec ] Merge branch 'master' into ACCUMULO-378
          Hide
          ASF subversion and git services added a comment -

          Commit de7f591ab6f818e1967f0d4d0e266a803b6f086d in accumulo's branch refs/heads/ACCUMULO-378 from Josh Elser
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=de7f591 ]

          ACCUMULO-378 Add details on replication "bookkeeping" on the master cluster.

          Show
          ASF subversion and git services added a comment - Commit de7f591ab6f818e1967f0d4d0e266a803b6f086d in accumulo's branch refs/heads/ ACCUMULO-378 from Josh Elser [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=de7f591 ] ACCUMULO-378 Add details on replication "bookkeeping" on the master cluster.
          Hide
          ASF subversion and git services added a comment -

          Commit 13561ebbb7480c18df3538c1eed04e8f218cfca2 in accumulo's branch refs/heads/ACCUMULO-378 from Josh Elser
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=13561eb ]

          ACCUMULO-378 Design document with first round of changes.

          Show
          ASF subversion and git services added a comment - Commit 13561ebbb7480c18df3538c1eed04e8f218cfca2 in accumulo's branch refs/heads/ ACCUMULO-378 from Josh Elser [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=13561eb ] ACCUMULO-378 Design document with first round of changes.
          Josh Elser made changes -
          Remote Link This issue links to "Original design doc link (Web Link)" [ 14768 ]
          Josh Elser made changes -
          Remote Link This issue links to "Design document review (Web Link)" [ 14742 ]
          Hide
          Josh Elser added a comment -

          New reviewboard that I own (instead of Keith) which will let me administer it better. We can leave the other reviewboard up to continue discussion there, but please start new discussion on this link.

          Show
          Josh Elser added a comment - New reviewboard that I own (instead of Keith) which will let me administer it better. We can leave the other reviewboard up to continue discussion there, but please start new discussion on this link.
          Josh Elser made changes -
          Remote Link This issue links to "Active Design Doc Review (Web Link)" [ 14767 ]
          Keith Turner made changes -
          Remote Link This issue links to "Design document review (Web Link)" [ 14742 ]
          Hide
          Josh Elser added a comment -

          FYI, I plan on starting to break down things into sub-tasks that can be worked on that (hopefully) are disjoint. Meanwhile, any feedback is welcome – although starting a thread on dev@a.a.o is likely better than doing it here.

          Show
          Josh Elser added a comment - FYI, I plan on starting to break down things into sub-tasks that can be worked on that (hopefully) are disjoint. Meanwhile, any feedback is welcome – although starting a thread on dev@a.a.o is likely better than doing it here.
          Hide
          Josh Elser added a comment -

          Design document that I've been working out that outlines some implementation details.

          Show
          Josh Elser added a comment - Design document that I've been working out that outlines some implementation details.
          Josh Elser made changes -
          Remote Link This issue links to "Design Document (Web Link)" [ 14726 ]
          Josh Elser made changes -
          Fix Version/s 1.7.0 [ 12324607 ]
          Josh Elser made changes -
          Assignee Josh Elser [ elserj ]
          Hide
          Josh Elser added a comment -

          Keith Turner, Sapan Shah, did you guys ever come up with any sort of design document? Looking back at the last chatter, we were still in a localfs WAL capability which is a bit out of date considering current Accumulo support

          Given Ravi Mutyala's question on dev@a.a.o about this, any interest in thinking about this as a major 1.7 feature? It will give something to think about while testing 1.6.0

          Show
          Josh Elser added a comment - Keith Turner , Sapan Shah , did you guys ever come up with any sort of design document? Looking back at the last chatter, we were still in a localfs WAL capability which is a bit out of date considering current Accumulo support Given Ravi Mutyala 's question on dev@a.a.o about this, any interest in thinking about this as a major 1.7 feature? It will give something to think about while testing 1.6.0
          Sapan Shah made changes -
          Assignee Sapan Shah [ sapanbshah42 ]
          Hide
          Jeff Whiting added a comment -

          While thinking about replication master - master replication should also be considered as it can have a large implications on how the replication is implemented.

          Show
          Jeff Whiting added a comment - While thinking about replication master - master replication should also be considered as it can have a large implications on how the replication is implemented.
          Gavin made changes -
          Field Original Value New Value
          Workflow no-reopen-closed, patch-avail [ 12652377 ] patch-available, re-open possible [ 12671648 ]
          Hide
          Keith Turner added a comment -

          We were discussing generating secondary indexes. This feature may be useful for that in addition to replicating to a remote cluster. So instead of replicating data to a remote cluster, replicate to another table on the local cluster with a data transformation step. For example, data is inserted in table A, then the mutations from table A get pushed to table B with a transformation step. This could also push bulk imports to table B and through the transformation.

          Show
          Keith Turner added a comment - We were discussing generating secondary indexes. This feature may be useful for that in addition to replicating to a remote cluster. So instead of replicating data to a remote cluster, replicate to another table on the local cluster with a data transformation step. For example, data is inserted in table A, then the mutations from table A get pushed to table B with a transformation step. This could also push bulk imports to table B and through the transformation.
          Hide
          Keith Turner added a comment -

          Sapan and I were discussing this issue. We were considering the use case were a user wants to filter some data in a table. To do this they may add filter, force a compaction, and then remove the filter. It would be nice to have this action replicate to the backup cluster. This may be easier if the action were more atomic, see ACCUMULO-420.

          Show
          Keith Turner added a comment - Sapan and I were discussing this issue. We were considering the use case were a user wants to filter some data in a table. To do this they may add filter, force a compaction, and then remove the filter. It would be nice to have this action replicate to the backup cluster. This may be easier if the action were more atomic, see ACCUMULO-420 .
          Hide
          Keith Turner added a comment -

          Replicating all of zookeeper would not work well, would not want to replicate info related to the root tablet location, tablet servers, loggers, and FATE operations from the master cluster. ZOOKEEPER-892 mentions the ability to replicate a sub-tree.

          Show
          Keith Turner added a comment - Replicating all of zookeeper would not work well, would not want to replicate info related to the root tablet location, tablet servers, loggers, and FATE operations from the master cluster. ZOOKEEPER-892 mentions the ability to replicate a sub-tree.
          Hide
          Keith Turner added a comment -

          Replicating table configuration would be useful. For example if a user enables an age off iterator on the master cluster for major compaction, it would be nice to have that run on the slave cluster and throw old data away. Would want the same iterators configured for the master and slave table, compression, locality groups, etc. Wonder if we could leverage ZOOKEEPER-892.

          Show
          Keith Turner added a comment - Replicating table configuration would be useful. For example if a user enables an age off iterator on the master cluster for major compaction, it would be nice to have that run on the slave cluster and throw old data away. Would want the same iterators configured for the master and slave table, compression, locality groups, etc. Wonder if we could leverage ZOOKEEPER-892 .
          Hide
          Sapan Shah added a comment -

          John: I am currently adapting the WAL to append to a cloned copy in HDFS while still being performant.

          Keith:

          I think collaborating would be a great idea. I'll work on getting design document together. I will be at the meetup, so we can discuss there the various tasks to work on for this. I see there being quite a bit.

          For the questions you asked.
          1) To begin with I was thinking about maybe doing just select tables so that you did not have complete replicas. Then maybe working on a way to possibly do total replicas.
          2) I am still working out a good way to have ZooKeeper send the updates for the user information. I am not sure about the table metadata yet, as if all we are doing is calling the client API, I think that might be taken care of, shouldn't it? As the slave table will maintain its own metadata.
          3) What you described with cloning the table, copying the data, and replicating the logs was my current plan.
          4) I have not looked into FATE that much, but will check it out.
          5) I am not sure about replicating the splits unless the user defined the splits before hand.

          Let me check into FATE, but from the skimming it seems really useful for this.

          Show
          Sapan Shah added a comment - John: I am currently adapting the WAL to append to a cloned copy in HDFS while still being performant. Keith: I think collaborating would be a great idea. I'll work on getting design document together. I will be at the meetup, so we can discuss there the various tasks to work on for this. I see there being quite a bit. For the questions you asked. 1) To begin with I was thinking about maybe doing just select tables so that you did not have complete replicas. Then maybe working on a way to possibly do total replicas. 2) I am still working out a good way to have ZooKeeper send the updates for the user information. I am not sure about the table metadata yet, as if all we are doing is calling the client API, I think that might be taken care of, shouldn't it? As the slave table will maintain its own metadata. 3) What you described with cloning the table, copying the data, and replicating the logs was my current plan. 4) I have not looked into FATE that much, but will check it out. 5) I am not sure about replicating the splits unless the user defined the splits before hand. Let me check into FATE, but from the skimming it seems really useful for this.
          Hide
          Keith Turner added a comment -

          I would like to collaborate w/ you on this. It seems like a starting point might be a design doc. Would you mind putting together a design doc detailing your thoughts on this? Any other suggestions on how we could collaborate? We could also meet at the meetup (http://www.meetup.com/Accumulo-Users-DC/events/45491582/) if you are in this area.

          Show
          Keith Turner added a comment - I would like to collaborate w/ you on this. It seems like a starting point might be a design doc. Would you mind putting together a design doc detailing your thoughts on this? Any other suggestions on how we could collaborate? We could also meet at the meetup ( http://www.meetup.com/Accumulo-Users-DC/events/45491582/ ) if you are in this area.
          Hide
          jv added a comment -

          I need a bit of clarification- are you adapting the WAL to log to HDFS via appends or are you working on a mechanism to shove the logs into HDFS once they are complete?

          Show
          jv added a comment - I need a bit of clarification- are you adapting the WAL to log to HDFS via appends or are you working on a mechanism to shove the logs into HDFS once they are complete?
          Hide
          Keith Turner added a comment -

          This sounds really cool. I looked at the HBase doc, it seems like it replays the walogs on the slave cluster through the client API.

          Where you thinking of doing this for all tables, or just select tables?
          What are your thoughts on replicating user and table metadata in zookeeper?
          What are your thoughts on enabling replication for existing data? (we clould clone the table, copy its existing data, and replicate new walogs created after the clone operation).
          How are you thinking of handling bulk imported data? (could possible copy to slave and bulk import on their also, this could be a FATE operation initiated by the bulk import FATE operation).
          What are your thoughts on replicating split and merge operations on the master cluster?

          I am wondering how much we can leverage FATE to make this easier and more reliable.

          Show
          Keith Turner added a comment - This sounds really cool. I looked at the HBase doc, it seems like it replays the walogs on the slave cluster through the client API. Where you thinking of doing this for all tables, or just select tables? What are your thoughts on replicating user and table metadata in zookeeper? What are your thoughts on enabling replication for existing data? (we clould clone the table, copy its existing data, and replicate new walogs created after the clone operation). How are you thinking of handling bulk imported data? (could possible copy to slave and bulk import on their also, this could be a FATE operation initiated by the bulk import FATE operation). What are your thoughts on replicating split and merge operations on the master cluster? I am wondering how much we can leverage FATE to make this easier and more reliable.
          Hide
          Sapan Shah added a comment -

          I have started some basic work on this, such as working on trying to get the WAL working on HDFS.

          Show
          Sapan Shah added a comment - I have started some basic work on this, such as working on trying to get the WAL working on HDFS.
          Sapan Shah created issue -

            People

            • Assignee:
              Josh Elser
              Reporter:
              Sapan Shah
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development