We have an internal service built on Redis that is considering writing through to HBase directly for their persistence needs. Their current experience with Redis is
- Average write latency is ~milliseconds
- p999 write latencies with are "a few seconds"
They want a similar experience when writing simple values directly to HBase. Infrequent exceptions to this would be acceptable.
- Availability of 99.9% for writes
- Expect most writes to be serviced within a few milliseconds, e.g. few millis at p95. Still evaluating what the requirement should be (~millis at p90 vs p95 vs p99).
- Timeout of 2 seconds, should be rare
There is a fallback plan considered if HBase cannot respond within 2 seconds. However this fallback cannot guarantee durability. Redis or the service's daemons may go down. They want HBase to provide required durability.
Because this is a caching service, where all writes are expected to be served again from cache, at least for a while, if HBase were to accept writes such that they are not immediately visible, it could be fine that they are not visible for 10-20 minutes in the worst case. This is relatively easy to achieve as an engineering target should we consider offering a write option that does not guarantee immediate visibility. (A proposal follows below.) We are considering store-and-forward of simple mutations and perhaps also simple deletes, although the latter is not a hard requirement. Out of order processing of this subset of mutation requests is acceptable because their data model ensures all values are immutable. Presumably on the HBase side the timestamps of the requests would be set to the current server wall clock time when received, so eventually when applied all are available with correct temporal ordering (within the effective resolution of the server clocks). Deletes which are not immediately applied (or failed) could cause application level confusion, and although this would remain a concern for the general case, for this specific use case, stale reads could be explained to and tolerated by their users.
The BigTable architecture assigns at most one server to serve a region at a time. Region Replicas are an enhancement to the base BigTable architecture we made in HBase which stands up two more read-only replicas for a given region, meaning a client attempting a read has the option to fail very quickly over from the primary to a replica for a (potentially stale) read, or distribute read load over all replicas, or employ a hedged reading strategy. Enabling region replicas and timeline consistency can lower the availability gap for reads in the high percentiles from ~minutes to ~milliseconds. However, this option will not help for write use cases wanting roughly the same thing, because there can be no fail-over for writes. Writes must still go to the active primary. When that region is in transition, writes must be held on the client until it is redeployed. Or, if region replicas are not enabled, when the sole region is in transition, again, writes must be held on the client until the region is available again.
Regions enter the in-transition state for two reasons: failures, and housekeeping (splits and merges, or balancing). Time to region redeployment after failures depends on a number of factors, like how long it took for us to become aware of the failure, and how long it takes to split the write-ahead log of the failed server and distribute the recovered edits to the reopening region(s). We could in theory improve this behavior by being more predictive about declaring failure, like employing a phi accrual failure detector to signal to the master from clients that a regionserver is sick. Other time-to-recovery issues and mitigations are discussed in a number of JIRAs and blog posts and not discussed further here. Regarding housekeeping activities, splits and merges typically complete in under a second. However, split times up to ~30 seconds have been observed at my place of employ in rare conditions. In the instances I have investigated the cause is I/O stalls on the datanodes and metadata request stalls in the namenode, so not unexpected outlier cases. Mitigating these risks involve looking at split and policies. Split and merge policies are pluggable, and policy choices can be applied per table. In extreme cases, auto-splitting (and auto-merging) can be disabled on performance sensitive tables and accomplished through manual means during scheduled maintenance windows. Regions may also be moved by the Balancer to avoid unbalanced loading over the available cluster resources. During balancing, one or more regions are closed on some servers, then opened on others. While closing, a region must flush all of its memstores, yet will not accept any new requests during flushing, because it is closing. This can lead to short availability gaps. The Balancer's strategy can be tuned, or on clusters where any disruption is undesirable, the balancer can be disabled, and enabled/invoked manually only during scheduled maintenance either by admin API or by plugging in a custom implementation that does nothing. While these options are available, they are needlessly complex to consider for use cases that can be satisfied with simple dogged store-and-forward of mutations accepted on behalf of an unavailable region. It would be far simpler from the user perspective to offer a new flag for mutation requests. It may also not be tenable to apply the global configuration changes discussed above in this paragraph to a multitenant cluster.
The requirement to always take writes even under partial failure conditions is a prime motivator for the development of eventually consistent systems. However while those systems can accept writes under a wider range of failure conditions than others, like HBase, which strive for consistency, they cannot guarantee those writes are immediately available for reads. Far from it. The guarantees about data availability and freshness are reduced or eliminated in eventually consistent designs. Consistent semantics remain highly desirable even though we have to make availability tradeoffs. Eventually consistent designs expose data inconsistency issues to their applications, and this is a constant pain point for even the best developers. We want to retain HBase's consistent semantics and operational model for the vast majority of use cases. That said, we can look at some changes that improve the apparent availability of an HBase cluster for a subset of simple mutation requests, for use cases that want to relax some guarantees for writes in a similar manner as we have done earlier for reads via the read replica feature.
If we accept the requirement to always accept writes, if any server is available, and there is no need to make them immediately visible, we can introduce a new write request attribute that says "it is fine to accept this on behalf of the now or future region holder, in a store-and-forward manner", for a subset of possible write operations: Append and Increment requires the server to return the up-to-date result, so are not possible. CheckAndXXX operations likewise must be executed at the primary. Deletes could be dangerous to apply out of order and so should not be accepted as a rule. Perhaps simple deletes could be supported, if an additional safety valve is switched off in configuration, but not DeleteColumn or DeleteFamily. Simple Puts and multi-puts (batch Put or RowMutations) can be safely serviced. This can still satisfy requirements for dogged persistence of simple writes and benefit a range of use cases. Should the primary region be unavailable for accepting this subset of writes, the client would contact another regionserver, any regionserver, with the new operation flag set, and that regionserver would then accept the write on behalf of the future holder of the in-transition region. (Technically, a client could set this flag at any time for any reason.) Regionservers to which writes are handed off must then efficiently and doggedly drain their store-and-forward queue. This queue must be durable across process failures and restarts. We can use existing regionserver WAL facilities to support this. This would be similar in some ways to how cross cluster replication is implemented. Edits are persisted to the WAL at the source, the WAL entries are later enumerated and applied at the sink. The differences here are:
- Regionservers would accept edits for regions they are not currently servicing.
- WAL replay must also handle these "martian" edits, adding them to the store-and-forward queue of the regionserver recovering the WAL.
- Regionservers will queue such edits and apply them to the local cluster instead of shipping them out; in other words, the local regionserver acts as a replication sink, not a source.
There could be no guarantee on the eventual visibility of requests accepted in this manner, and writes accepted into store-and-forward queues may be applied out of order, although the timestamp component of HBase keys will ensure correct temporal ordering (within the effective resolution of the server clocks) after all are eventually applied. This is consistent with the semantics one gets with eventually consistent systems. This would not be default behavior, nor default semantics. HBase would continue to trade off for consistency, unless this new feature/flag is enabled by an informed party. This is consistent with the strategy we adopted for region replicas.
Like with cross-cluster replication we would want to provide metrics on the depth of the queues and maximum age of entries in these queues, so operators can get a sense of how far behind they might be, to determine compliance with application service level objectives.
Implementation of this feature should satisfy compatibility policy constraints such that minor releases can accept it. At the very least we would require it in a new branch-1 minor. This is a hard requirement.
Concerns about ordering of value versions and application of operation precedence rules (e.g. deletes before puts) within a single clock tick can be mitigated by an additional, relatively small change: We can ensure that only one operation per row can be committed per clock tick. (The row is our unit of atomicity and the scope where monotonicity in timestamp assignment will be useful.) In the normal case the overhead is only an extra long for tracking the last commit time. We already need to get the current time for other reasons. In the edge case we must spin until the clock ticks over. The impact on throughput and CPU will depend on how often we hit this case, but it is expected to be rare. Operators should also ensure, through monitoring of clock drift over the fleet, that no clock is so far ahead that region failover from one server to another will defeat the monotonicity gained by this strategy.