Details
-
Brainstorming
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Caching in data serving remains crucial for performance. Networks are fast but not yet fast enough. RDMA may change this once it becomes more popular and available. Caching layers should be resilient to crashes to avoid the cost of rewarming. In the context of HBase with root filesystem placed on S3, the object store is quite slow relative to other options like HDFS, so caching is particularly essential given the rewarming costs will be high, either client visible performance degradation (due to cache miss and reload) or elevated IO due to prefetching.
We expect for cloud serving when backed by S3 the HBase blockcache will be configured for hosting the entirety of the warm set, which may be very large, so we also expect the selection of the file backed option and the placement of the filesystem for cache file storage on local fast solid state devices. These devices offer data persistence beyond the lifetime of an individual process. We can take advantage of this to make block caching partially resilient to short duration process failures and restarts.
When the blockcache is backed by a file system, when starting up it can reinitialize and prewarm using a scan over preexisting disk contents. These will be cache files left behind by another process executing earlier on the same instance. This strategy is applicable to process restart and rolling upgrade scenarios specifically. (The local storage may not survive an instance reboot.)
Once the server has reloaded the blockcache metadata from local storage it can advertise to the HMaster the list of HFiles for which it has some precached blocks resident. This implies the blockcache's file backed option should maintain a mapping of source HFile paths for the blocks in cache. We don't need to provide more granular information on which blocks (or not) of the HFile are in cache. It is unlikely entries for the HFile will be cached elsewhere. We can assume placement of a region containing the HFile on a server with any block cached there will be better than alternatives.
The HMaster already waits for regionserver registration activity to stabilize before assigning regions and we can contemplate adding configurable delay in region reassignment for sever crash handling in the hopes a restarted or recovered instance will come online and report in-cache reloaded contents in time for an assignment decision to consider this new factor in data locality. When finally processing (re)assignment the HMaster can consider this additional factor when building the assignment plan. We already calculate a HDFS level locality metric. We can also calculate a new cache level locality metric aggregated from regionserver reports of re-warmed cache contents. For a given region we can build a candidate assignment set of servers reporting cached blocks for its associated HFiles, and the master can assign the region to the server with the highest weight. Otherwise we (re)assign using the HDFS locality metric as before.
In this way during rolling restart or quick process restart via supervisory process scenarios we are very likely to assign a region back to the server that was most recently hosting it, and we can pick up for immediate reuse any file backed blockcache data accumulated for the region by the previous process. These are going to be the most common scenarios encountered during normal cluster operation. This will allow HBase's internal data caching to be resilient to short duration crashes and administrative process restarts.