Index: src/main/java/org/apache/hadoop/hbase/replication/package.html =================================================================== --- src/main/java/org/apache/hadoop/hbase/replication/package.html (revision 963949) +++ src/main/java/org/apache/hadoop/hbase/replication/package.html (working copy) @@ -1,128 +0,0 @@ - - - - - - - -

Multi Cluster Replication

-This package provides replication between HBase clusters. -

- -

Table Of Contents

-
    -
  1. Status
  2. -
  3. Requirements
  4. -
  5. Deployment
  6. -
- -

- -

Status

- -

-This package is alpha quality software and is only meant to be a base -for future developments. The current implementation offers the following -features: - -

    -
  1. Master/Slave replication limited to 1 slave cluster.
  2. -
  3. Replication of scoped families in user tables.
  4. -
  5. Start/stop replication stream.
  6. -
  7. Supports clusters of different sizes.
  8. -
  9. Handling of partitions longer than 10 minutes
  10. -
-Please report bugs on the project's Jira when found. -

- -

Requirements

- -

- -Before trying out replication, make sure to review the following requirements: - -

    -
  1. Zookeeper should be handled by yourself, not by HBase, and should - always be available during the deployment.
  2. -
  3. All machines from both clusters should be able to reach every - other machine since replication goes from any region server to any - other one on the slave cluster. That also includes the - Zookeeper clusters.
  4. -
  5. Both clusters should have the same HBase and Hadoop major revision. - For example, having 0.21.1 on the master and 0.21.0 on the slave is - correct but not 0.21.1 and 0.22.0.
  6. -
  7. Every table that contains families that are scoped for replication - should exist on every cluster with the exact same name, same for those - replicated families.
  8. -
- -

- -

Deployment

- -

- -The following steps describe how to enable replication from a cluster -to another. This must be done with both clusters offlined. -

    -
  1. Edit ${HBASE_HOME}/conf/hbase-site.xml on both cluster to add - the following configurations: -
    -<property>
    -  <name>hbase.replication.enabled</name>
    -  <value>true</value>
    -</property>
    -
  2. -
  3. Run the following command on any cluster: -
    -$HBASE_HOME/bin/hbase org.jruby.Main $HBASE_HOME/replication/bin/add_peer.tb
    - This will show you the help to setup the replication stream between - both clusters. If both clusters use the same Zookeeper cluster, you have - to use a different zookeeper.znode.parent since they can't - write in the same folder. -
  4. -
  5. You can now start and stop the clusters with your preferred method.
  6. -
- -You can confirm that your setup works by looking at any region server's log -on the master cluster and look for the following lines; - -
-Considering 1 rs, with ratio 0.1
-Getting 1 rs from peer cluster # 0
-Choosing peer 10.10.1.49:62020
- -In this case it indicates that 1 region server from the slave cluster -was chosen for replication.

- -Should you want to stop the replication while the clusters are running, open -the shell on the master cluster and issue this command: -
-hbase(main):001:0> zk 'set /zookeeper.znode.parent/replication/state false'
- -Where you replace the znode parent with the one configured on your master -cluster. Replication of already queued edits will still happen after you -issued that command but new entries won't be. To start it back, simply replace -"false" with "true" in the command. - -

- - - Index: src/main/javadoc/org/apache/hadoop/hbase/replication/package.html =================================================================== --- src/main/javadoc/org/apache/hadoop/hbase/replication/package.html (revision 0) +++ src/main/javadoc/org/apache/hadoop/hbase/replication/package.html (working copy) @@ -38,7 +38,7 @@

Status

-This package is alpha quality software and is only meant to be a base +This package is experimental quality software and is only meant to be a base for future developments. The current implementation offers the following features: @@ -66,8 +66,8 @@ other one on the slave cluster. That also includes the Zookeeper clusters.

  • Both clusters should have the same HBase and Hadoop major revision. - For example, having 0.21.1 on the master and 0.21.0 on the slave is - correct but not 0.21.1 and 0.22.0.
  • + For example, having 0.90.1 on the master and 0.90.0 on the slave is + correct but not 0.90.1 and 0.89.20100725.
  • Every table that contains families that are scoped for replication should exist on every cluster with the exact same name, same for those replicated families.
  • @@ -86,13 +86,13 @@ the following configurations:
     <property>
    -  <name>hbase.replication.enabled</name>
    +  <name>hbase.replication</name>
       <value>true</value>
     </property>
  • Run the following command on any cluster:
    -$HBASE_HOME/bin/hbase org.jruby.Main $HBASE_HOME/replication/bin/add_peer.tb
    +$HBASE_HOME/bin/hbase org.jruby.Main $HBASE_HOME/bin/replication/add_peer.tb This will show you the help to setup the replication stream between both clusters. If both clusters use the same Zookeeper cluster, you have to use a different zookeeper.znode.parent since they can't Index: src/site/site.xml =================================================================== --- src/site/site.xml (revision 963949) +++ src/site/site.xml (working copy) @@ -35,6 +35,7 @@ + Index: src/site/xdoc/replication.xml =================================================================== --- src/site/xdoc/replication.xml (revision 0) +++ src/site/xdoc/replication.xml (revision 0) @@ -0,0 +1,420 @@ + + + + + + + + + HBase Replication + + + +
    +

    + HBase replication is a way to copy data between HBase deployments. It + can serve as a disaster recovery solution and can contribute to provide + higher availability at HBase layer. It can also serve more practically; + for example, as a way to easily copy edits from a web-facing cluster to a "MapReduce" + cluster which will process old and new data and ship back the results + automatically. +

    +

    + The basic architecture pattern used for HBase replication is (HBase cluster) master-push; + it is much easier to keep track of what’s currently being replicated since + each region server has its own write-ahead-log (aka WAL or HLog), just like + other well known solutions like MySQL master/slave replication where + there’s only one bin log to keep track of. One master cluster can + replicate to any number of slave clusters, and each region server will + participate to replicate their own stream of edits. +

    +

    + The replication is done asynchronously, meaning that the clusters can + be geographically distant, the links between them can be offline for + some time, and rows inserted on the master cluster won’t be + available at the same time on the slave clusters (eventual consistency). +

    +

    + The replication format used in this design is conceptually the same as + + MySQL’s statement-based replication. Instead of SQL statements, whole + WALEdits (consisting of multiple cell inserts coming from the clients' + Put and Delete) are replicated in order to maintain atomicity. +

    +

    + The HLogs from each region server are the basis of HBase replication, + and must be kept in HDFS as long as they are needed to replicate data + to any slave cluster. Each RS reads from the oldest log it needs to + replicate and keeps the current position inside ZooKeeper to simplify + failure recovery. That position can be different for every slave + cluster, same for the queue of HLogs to process. +

    +

    + The clusters participating in replication can be of asymmetric sizes + and the master cluster will do its “best effort” to balance the stream + of replication on the slave clusters by relying on randomization. +

    + +
    +
    +

    + The following sections describe the life of a single edit going from a + client that communicates with a master cluster all the way to a single + slave cluster. +

    +
    +

    + The client uses a HBase API that sends a Put, Delete or ICV to a region + server. The key values are transformed into a WALEdit by the region + server and is inspected by the replication code that, for each family + that is scoped for replication, adds the scope to the edit. The edit + is appended to the current WAL and is then applied to its MemStore. +

    +

    + In a separate thread, the edit is read from the log (as part of a batch) + and only the KVs that are replicable are kept (that is, that are part + of a family scoped GLOBAL in the family's schema and non-catalog so not + .META. or -ROOT-). When the buffer is filled, or the reader hits the + end of the file, the buffer is sent to a random region server on the + slave cluster. +

    +

    + Synchronously, the region server that receives the edits reads them + sequentially and applies them on its own cluster using the HBase client + (HTables managed by a HTablePool) automatically. If consecutive rows + belong to the same table, they are inserted together in order to + leverage parallel insertions. +

    +

    + Back in the master cluster's region server, the offset for the current + WAL that's being replicated is registered in ZooKeeper. +

    +
    +
    +

    + The edit is inserted in the same way. +

    +

    + In the separate thread, the region server reads, filters and buffers + the log edits the same way as during normal processing. The slave + region server that's contacted doesn't answer to the RPC, so the master + region server will sleep and retry up to a configured number of times. + If the slave RS still isn't available, the master cluster RS will select a + new subset of RS to replicate to and will retry sending the buffer of + edits. +

    +

    + In the mean time, the WALs will be rolled and stored in a queue in + ZooKeeper. Logs that are archived by their region server (archiving is + basically moving a log from the region server's logs directory to a + central logs archive directory) will update their paths in the in-memory + queue of the replicating thread. +

    +

    + When the slave cluster is finally available, the buffer will be applied + the same way as during normal processing. The master cluster RS will then + replicate the backlog of logs. +

    +
    +
    +
    +

    + This section describes in depth how each of replication's internal + features operate. +

    +
    +

    + When a master cluster RS initiates a replication source to a slave cluster, + it first connects to the slave's ZooKeeper ensemble using the provided + cluster key (taht key is composed of the value of hbase.zookeeper.quorum, + zookeeper.znode.parent and hbase.zookeeper.property.clientPort). It + then scans the "rs" directory to discover all the available sinks + (region servers that are accepting incoming streams of edits to replicate) + and will randomly choose a subset of them using a configured + ratio (which has a default value of 10%). For example, if a slave + cluster has 150 machines, 15 will be chosen as potential recipient for + edits that this master cluster RS will be sending. Since this is done by all + master cluster RSs, the probability that all slave RSs are used is very high, + and this method works for clusters of any size. For example, a master cluster + of 10 machines replicating to a slave cluster of 5 machines with a ratio + of 10% means that the master cluster RSs will choose one machine each + at random, thus the chance of overlapping and full usage of the slave + cluster is higher. +

    +
    +
    +

    + Every master cluster RS has its own znode in the replication znodes hierarchy. + It contains one znode per peer cluster (if 5 slave clusters, 5 znodes + are created), and each of these contain a queue + of HLogs to process. Each of these queues will track the HLogs created + by that RS, but they can differ in size. For example, if one slave + cluster becomes unavailable for some time then the HLogs cannot be, + thus they need to stay in the queue (while the others are processed). + See the section named "Region server failover" for an example. +

    +

    + When a source is instantiated, it contains the current HLog that the + region server is writing to. During log rolling, the new file is added + to the queue of each slave cluster's znode just before it's made available. + This ensures that all the sources are aware that a new log exists + before HLog is able to append edits into it, but this operations is + now more expensive. + The queue items are discarded when the replication thread cannot read + more entries from a file (because it reached the end of the last block) + and that there are other files in the queue. + This means that if a source is up-to-date and replicates from the log + that the region server writes to, reading up to the "end" of the + current file won't delete the item in the queue. +

    +

    + When a log is archived (because it's not used anymore or because there's + too many of them per hbase.regionserver.maxlogs typically because insertion + rate is faster than region flushing), it will notify the source threads that the path + for that log changed. If the a particular source was already done with + it, it will just ignore the message. If it's in the queue, the path + will be updated in memory. If the log is currently being replicated, + the change will be done atomically so that the reader doesn't try to + open the file when it's already moved. Also, moving a file is a NameNode + operation so, if the reader is currently reading the log, it won't + generate any exception. +

    +
    +
    +

    + By default, a source will try to read from a log file and ship log + entries as fast as possible to a sink. This is first limited by the + filtering of log entries; only KeyValues that are scoped GLOBAL and + that don't belong to catalog tables will be retained. A second limit + is imposed on the total size of the list of edits to replicate per slave, + which by default is 64MB. This means that a master cluster RS with 3 slaves + will use at most 192MB to store data to replicate. This doesn't account + the data filtered that wasn't garbage collected. +

    +

    + Once the maximum size of edits was buffered or the reader hits the end + of the log file, the source thread will stop reading and will choose + at random a sink to replicate to (from the list that was generated by + keeping only a subset of slave RSs). It will directly issue a RPC to + the chosen machine and will wait for the method to return. If it's + successful, the source will determine if the current file is emptied + or if it should continue to read from it. If the former, it will delete + the znode in the queue. If the latter, it will register the new offset + in the log's znode. If the RPC threw an exception, the source will retry + 10 times until trying to find a different sink. +

    +
    +
    +

    + The sink synchronously applies the edits to its local cluster using + the native client API. This method is also synchronized between every + incoming sources, which means that only one batch of log entries can be + replicated at a time by each slave region server. +

    +

    + The sink applies the edits one by one, unless it's able to batch + sequential Puts that belong to the same table in order to use the + parallel puts feature of HConnectionManager. The Put and Delete objects + are recreated by inspecting the incoming WALEdit objects and are + with the exact same row, family, qualifier, timestamp, and value (for + Put). Note that if the master and slave cluster don't have the same + time, time-related issues may occur. +

    +
    +
    +

    + If replication isn't enabled, the master's logs cleaning thread will + delete old logs using a configured TTL. This doesn't work well with + replication since archived logs passed their TTL may still be in a + queue. Thus, the default behavior is augmented so that if a log is + passed its TTL, the cleaning thread will lookup every queue until it + finds the log (while caching the ones it finds). If it's not found, + the log will be deleted. The next time it has to look for a log, + it will first use its cache. +

    +
    +
    +

    + As long as region servers don't fail, keeping track of the logs in ZK + doesn't add any value. Unfortunately, they do fail, so since ZooKeeper + is highly available we can count on it and its semantics to help us + managing the transfer of the queues. +

    +

    + All the master cluster RSs keep a watcher on every other one of them to be + notified when one dies (just like the master does). When it happens, + they all race to create a znode called "lock" inside the dead RS' znode + that contains its queues. The one that creates it successfully will + proceed by transferring all the queues to its own znode (one by one + since ZK doesn't support the rename operation) and will delete all the + old ones when it's done. The recovered queues' znodes will be named + with the id of the slave cluster appended with the name of the dead + server. +

    +

    + Once that is done, the master cluster RS will create one new source thread per + copied queue, and each of them will follow the read/filter/ship pattern. + The main difference is that those queues will never have new data since + they don't belong to their new region server, which means that when + the reader hits the end of the last log, the queue's znode will be + deleted and the master cluster RS will close that replication source. +

    +

    + For example, consider a master cluster with 3 region servers that's + replicating to a single slave with id '2'. The following hierarchy + represents what the znodes layout could be at some point in time. We + can see the RSs' znodes all contain a "peers" znode that contains a + single queue. The znode names in the queues represent the actual file + names on HDFS in the form "address,port.timestamp". +

    +
    +/hbase/replication/rs/
    +                      1.1.1.1,60020,123456780/
    +                        peers/
    +                              2/
    +                                1.1.1.1,60020.1234  (Contains a position)
    +                                1.1.1.1,60020.1265
    +                      1.1.1.2,60020,123456790/
    +                        peers/
    +                              2/
    +                                1.1.1.2,60020.1214  (Contains a position)
    +                                1.1.1.2,60020.1248
    +                                1.1.1.2,60020.1312
    +                      1.1.1.3,60020,    123456630/
    +                        peers/
    +                              2/
    +                                1.1.1.3,60020.1280  (Contains a position)
    +
    +        
    +

    + Now let's say that 1.1.1.2 loses its ZK session. The survivors will race + to create a lock, and for some reasons 1.1.1.3 wins. It will then start + transferring all the queues to its local peers znode by appending the + name of the dead server. Right before 1.1.1.3 is able to clean up the + old znodes, the layout will look like the following: +

    +
    +/hbase/replication/rs/
    +                      1.1.1.1,60020,123456780/
    +                        peers/
    +                              2/
    +                                1.1.1.1,60020.1234  (Contains a position)
    +                                1.1.1.1,60020.1265
    +                      1.1.1.2,60020,123456790/
    +                        lock
    +                        peers/
    +                              2/
    +                                1.1.1.2,60020.1214  (Contains a position)
    +                                1.1.1.2,60020.1248
    +                                1.1.1.2,60020.1312
    +                      1.1.1.3,60020,123456630/
    +                        peers/
    +                              2/
    +                                1.1.1.3,60020.1280  (Contains a position)
    +
    +                              2-1.1.1.2,60020,123456790/
    +                                1.1.1.2,60020.1214  (Contains a position)
    +                                1.1.1.2,60020.1248
    +                                1.1.1.2,60020.1312
    +        
    +

    + Some time later, but before 1.1.1.3 is able to finish replicating the + last HLog from 1.1.1.2, let's say that it dies too (also some new logs + were created in the normal queues). The last RS will then try to lock + 1.1.1.3's znode and will begin transferring all the queues. The new + layout will be: +

    +
    +/hbase/replication/rs/
    +                      1.1.1.1,60020,123456780/
    +                        peers/
    +                              2/
    +                                1.1.1.1,60020.1378  (Contains a position)
    +
    +                              2-1.1.1.3,60020,123456630/
    +                                1.1.1.3,60020.1325  (Contains a position)
    +                                1.1.1.3,60020.1401
    +
    +                              2-1.1.1.2,60020,123456790-1.1.1.3,60020,123456630/
    +                                1.1.1.2,60020.1312  (Contains a position)
    +                      1.1.1.3,60020,123456630/
    +                        lock
    +                        peers/
    +                              2/
    +                                1.1.1.3,60020.1325  (Contains a position)
    +                                1.1.1.3,60020.1401
    +
    +                              2-1.1.1.2,60020,123456790/
    +                                1.1.1.2,60020.1312  (Contains a position)
    +        
    +
    +
    +
    +

    + Here's a list of all the jiras that relate to major issues or missing + features in the replication implementation. +

    +
    +

    + HBASE-2688, master-master replication is disabled in the code, we need + to enable and test it. +

    +
    +
    +

    + HBASE-2611, basically if a region server dies while recovering the + queues of another dead RS, we will miss the data from the queues + that weren't copied. +

    +
    +
    +

    + HBASE-2196, a master cluster can only support a single slave, some + refactoring is needed to support this. +

    +
    +
    +

    + HBASE-2195, edits are applied disregard their home cluster, it should + carry that data and check it. +

    +
    +
    +

    + HBASE-2200, currently all the replication operations (adding or removing + streams for example) are done only when the clusters are offline. This + should be possible at runtime. +

    +
    +
    +
    +
    +

    + Here we discuss why we went with master-push first. +

    +
    +
    +

    + +

    +
    +
    + +
    \ No newline at end of file