Index: src/site/site.xml =================================================================== --- src/site/site.xml (revision 962728) +++ src/site/site.xml (working copy) @@ -35,6 +35,7 @@ + Index: src/site/xdoc/replication.xml =================================================================== --- src/site/xdoc/replication.xml (revision 0) +++ src/site/xdoc/replication.xml (revision 0) @@ -0,0 +1,401 @@ + + + + + + + + + HBase Replication + + + +
+

+ HBase replication is a way to copy data between HBase deployments. It + can serve as a disaster recovery solution and can contribute to enable + high availability. It can also serve more practically; for example, as a + way to easily copy edits from a web-facing cluster to a "MapReduce" + cluster which will process old and new data and ship back the results + automatically. +

+

+ The basic architecture pattern used for HBase replication is master-push; + it is much easier to keep track of what’s currently being replicated since + each region server has its own write-ahead-log (aka WAL or HLog), compared + to other well known solutions like MySQL master/slave replication where + there’s only one bin log to keep track of. One master cluster can + replicate to any number of slave clusters, and each region server will + participate to replicate their own stream of edits. +

+

+ The replication is done asynchronously, meaning that the clusters can + be geographically distant, the links between them can be offline for + some time, and that rows inserted on the master cluster won’t be + available at the same time on the slave clusters (eventual consistency). +

+

+ The replication format used in this design is conceptually the same as + MySQL’s statement-based replication. Instead of SQL statements, whole + WALEdits (consisting of multiple cell inserts coming from the clients' + Put and Delete) are replicated in order maintain atomicity. +

+

+ The HLogs from each region server are the basis of HBase replication, + and must be kept in HDFS as long as they are needed to replicate data + to any slave cluster. Each RS reads from the oldest log it needs to + replicate and keeps the current position inside ZooKeeper to simplify + failure recovery. That position can be different for every slave + cluster, same for the queue of HLogs to process. +

+

+ The clusters participating in replication can be of asymmetric sizes + and the master cluster will do its “best effort” to balance the stream + of replication on the slave clusters by relying on randomization. +

+ +
+
+

+ The following sections describe the life of a single edit going from a + client that communicates with a master cluster all the way to a single + slave cluster. +

+
+

+ The client uses a HBase API that sends a Put, Delete or ICV to a region + server. The key values are transformed into a WALEdit which is inspected + by the replication code and, for each family that is scoped for replication, + it adds the scope to the edit. The edit is appended to the current WAL + and is then applied to its MemStore. +

+

+ In a separate thread, the edit is read from the log (as part of a batch) + and only the KVs that are replicable are kept (that is, that are part + of a family scoped GLOBAL and non-catalog). When the buffer is filled, + or the reader hits the end of the file, the buffer is sent to a random + region server on the slave cluster. +

+

+ Synchronously, the region server that receives the edits reads them + sequentially and applies them on its local cluster using a pool of + HTables. If consecutive rows belong to the same table, they are + inserted together in order to leverage parallel insertions. +

+

+ Back in the master cluster's region server, the offset for the current + WAL that's being replicated is registered in ZooKeeper. +

+
+
+

+ The edit is inserted in the same way. +

+

+ In the separate thread, the region server reads, filters and buffers + the log edits the same way as during normal processing. The slave + region server that's contacted doesn't answer to the RPC, so the master + region server will sleep and retry up to a configured number of times. + If the slave RS still isn't available, the master RS will select a + new subset of RS to replicate to and will retry sending the buffer of + edits. +

+

+ In the mean time, the WALs will be rolled and stored in a queue in + ZooKeeper. Logs that are archived will update their paths in the + in-memory queue of the replicating thread. +

+

+ When the slave cluster is finally available, the buffer will be applied + the same way as during normal processing. The master RS will then + replicate the backlog of logs. +

+
+
+
+

+ This section describes in depth how each of replication's internal + features operate. +

+
+

+ When a master RS initiates a replication source to a slave cluster, + it first connects to the slave's ZooKeeper ensemble using the provided + cluster key. It then scans the "rs" directory to discover all the + available sinks and will randomly choose a subset of them using a configured + ratio (which has a default value of 10%). For example, if a slave + cluster has 150 machines, 15 will be chosen as potential sinks for this + master RS. Since this is done by all master RSs, the probability that + all slave RSs are used is very high, and this method works for clusters + of any size. +

+
+
+

+ Every master RS has its own znode in the replication znodes hierarchy. + It contains one znode per peer cluster, and each of those contain a queue + of HLogs to process. Each of those queues will track the HLogs created + by that RS, but they can differ in size. For example, if one slave + cluster becomes unavailable for some time then the HLogs cannot be, + thus they need to stay in the queue (while the others are processed). +

+

+ When a source is instantiated, it contains the current HLog that the + region server is writing to. During log rolling, the new file is added + to the queue of each slave cluster's znode just before it's made available. + The queue items are deleted when the replication thread cannot read + more entries from a file and that there are other files in the queue. + This means that if a source is up-to-date and replicates from the log + that the region server writes to, reading up to the "end" of the + current file won't delete the item in the queue. +

+

+ When a log is archived (because it's not used anymore or because there's + too many of them), it will notify the source threads that the path + for that log changed. If the a particular source was already done with + it, it will just ignore the message. If it's in the queue, the path + will be updated in memory. If the log is currently being replicated, + the change will be done atomically so that the reader doesn't try to + open the file when it's already moved. Also, moving a file is a NameNode + operation so, if the reader is currently reading the log, it won't + generate any exception. +

+
+
+

+ By default, a source will try to read from a log file and ship log + entries as fast as possible to a sink. This is first limited by the + filtering of log entries; only KeyValues that are scoped GLOBAL and + that don't belong to catalog tables will be retained. A second limit + is imposed on the total size of the list of edits to replicate per slave, + which by default is 64MB. This means that a master RS with 3 slaves + will use at most 192MB to store data to replicate. This doesn't account + the data filtered that wasn't garbage collected. +

+

+ Once the maximum size of edits was buffered or the reader hits the end + of the log file, the source thread will stop reading and will choose + at random a sink to replicate to (from the list that was generated by + keeping only a subset of slave RSs). It will directly issue a RPC to + the chosen machine and will wait for the method to return. If it's + successful, the source will determine if the current file is emptied + or if it should continue to read from it. If the former, it will delete + the znode in the queue. If the latter, it will register the new offset + in the log's znode. If the RPC threw an exception, the source will retry + 10 times until trying to find a different sink. +

+
+
+

+ The sink synchronously applies the edits to its local cluster using + the native client API. This method is also synchronized between every + incoming sources, which means that only one batch of log entries can be + replicated at a time by each slave region server. +

+

+ The sink applies the edits one by one, unless it's able to batch + sequential Puts that belong to the same table in order to use the + parallel puts feature of HConnectionManager. The Put and Delete objects + are recreated by inspecting the incoming WALEdit objects and are + with the exact same row, family, qualifier, timestamp, and value (for + Put). Note that if the master and slave cluster don't have the same + time, time-related issues may occur. +

+
+
+

+ If replication isn't enabled, the master's logs cleaning thread will + delete old logs using a configured TTL. This doesn't work well with + replication since archived logs passed their TTL may still be in a + queue. Thus, the default behavior is augmented so that if a log is + passed its TTL, the cleaning thread will lookup every queue until it + finds the log (while caching the ones it finds). If it's not found, + the log will be deleted. The next time it has to look for a log, + it will first use its cache. +

+
+
+

+ As long as region servers don't fail, keeping track of the logs in ZK + doesn't add any value. Unfortunately, they do fail, so since ZooKeeper + is highly available we can count on it and its semantics to help us + managing the transfer of the queues. +

+

+ All the master RSs keep a watcher on every other one of them to be + notified when one dies (just like the master does). When it happens, + they all race to create a znode called "lock" inside the dead RS' znode + that contains its queues. The one that creates it successfully will + proceed by transferring all the queues to its own znode (one by one + since ZK doesn't support the rename operation) and will delete all the + old ones when it's done. The recovered queues' znodes will be named + with the id of the slave cluster appended with the name of the dead + server. +

+

+ Once that is done, the master RS will create one new source thread per + copied queue, and each of them will follow the read/filter/ship pattern. + The main difference is that those queues will never have new data since + they don't belong to their new region server, which means that when + the reader hits the end of the last log, the queue's znode will be + deleted and the master RS will close that replication source. +

+

+ For example, consider a master cluster with 3 region servers that's + replicating to a single slave with id '2'. The following hierarchy + represents what the znodes layout could be at some point in time. We + can see the RSs' znodes all contain a "peers" znode that contains a + single queue. The znode names in the queues represent the actual file + names on HDFS in the form "address,port.timestamp". +

+
+/hbase/replication/rs/
+                      1.1.1.1,60020,123456780/
+                        peers/
+                              2/
+                                1.1.1.1,60020.1234  (Contains a position)
+                                1.1.1.1,60020.1265
+                      1.1.1.2,60020,123456790/
+                        peers/
+                              2/
+                                1.1.1.2,60020.1214  (Contains a position)
+                                1.1.1.2,60020.1248
+                                1.1.1.2,60020.1312
+                      1.1.1.3,60020,    123456630/
+                        peers/
+                              2/
+                                1.1.1.3,60020.1280  (Contains a position)
+
+        
+

+ Now let's say that 1.1.1.2 loses its ZK session. The survivors will race + to create a lock, and for some reasons 1.1.1.3 wins. It will then start + transferring all the queues to its local peers znode by appending the + name of the dead server. Right before 1.1.1.3 is able to clean up the + old znodes, the layout will look like the following: +

+
+/hbase/replication/rs/
+                      1.1.1.1,60020,123456780/
+                        peers/
+                              2/
+                                1.1.1.1,60020.1234  (Contains a position)
+                                1.1.1.1,60020.1265
+                      1.1.1.2,60020,123456790/
+                        lock
+                        peers/
+                              2/
+                                1.1.1.2,60020.1214  (Contains a position)
+                                1.1.1.2,60020.1248
+                                1.1.1.2,60020.1312
+                      1.1.1.3,60020,123456630/
+                        peers/
+                              2/
+                                1.1.1.3,60020.1280  (Contains a position)
+
+                              2-1.1.1.2,60020,123456790/
+                                1.1.1.2,60020.1214  (Contains a position)
+                                1.1.1.2,60020.1248
+                                1.1.1.2,60020.1312
+        
+

+ Some time later, but before 1.1.1.3 is able to finish replicating the + last HLog from 1.1.1.2, let's say that it dies too (also some new logs + were created in the normal queues). The last RS will then try to lock + 1.1.1.3's znode and will begin transferring all the queues. The new + layout will be: +

+
+/hbase/replication/rs/
+                      1.1.1.1,60020,123456780/
+                        peers/
+                              2/
+                                1.1.1.1,60020.1378  (Contains a position)
+
+                              2-1.1.1.3,60020,123456630/
+                                1.1.1.3,60020.1325  (Contains a position)
+                                1.1.1.3,60020.1401
+
+                              2-1.1.1.2,60020,123456790-1.1.1.3,60020,123456630/
+                                1.1.1.2,60020.1312  (Contains a position)
+                      1.1.1.3,60020,123456630/
+                        lock
+                        peers/
+                              2/
+                                1.1.1.3,60020.1325  (Contains a position)
+                                1.1.1.3,60020.1401
+
+                              2-1.1.1.2,60020,123456790/
+                                1.1.1.2,60020.1312  (Contains a position)
+        
+
+
+
+

+ Here's a list of all the jiras that relate to major issues or missing + features in the replication implementation. +

+
+

+ HBASE-2688, master-master replication is disabled in the code, we need + to enable and test it. +

+
+
+

+ HBASE-2611, basically if a region server dies while recovering the + queues of another dead RS, we will miss the data from the queues + that weren't copied. +

+
+
+

+ HBASE-2196, a master cluster can only support a single slave, some + refactoring is needed to support this. +

+
+
+

+ HBASE-2195, edits are applied disregard their home cluster, it should + carry that data and check it. +

+
+
+

+ HBASE-2200, currently all the replication operations (adding or removing + streams for example) are done only when the clusters are offline. This + should be possible at runtime. +

+
+
+
+
+

+ Here we discuss why we went with master-push first. +

+
+
+

+ +

+
+
+ +
\ No newline at end of file