Level compactions on HBase - design and code notes.
***** Introduction.
The familiarity with levelDB compaction description is assumed.
The "range" below refers to the key range of a store file unless otherwise specified.
***** General notes (not just for level compaction).
*** How to make things swappable.
Level compactions require different store file structure compared to the current one - there is no way to load level-compacted files into linear list of files and use them via old code; therefore, a lot of store-related things will have to be pluggable to make level compaction possible. On the other hand, different compaction algorithms like tiered compaction may replace just the file selection policy, keeping the store file structure intact; yet others may replace the compactor to optimize it, or whatever.
Therefore, the suggested way to make compactions pluggable is as such. There will be just one pluggable class, called StorageEngine (name open to debate). It will serve merely as a factory of objects of several interface types, such as StoreFileManager (not the same as HStore!), Compactor, CompactionPolicy, and maybe others. Each of these objects will be independent in a sense that implementations will be usable in different combinations, with StorageEngine author being responsible for making the valid one.
By default, StorageEngine will contain normal store file management and default compactor and policy. Level compaction will replace all of these. Tier compaction will replace just the policy, etc.
*** Migration.
Some store file structures are incompatible with some StorageEngine-s (see below about level structure "metadata changes" and "migration" section for it). When the region is loaded, the currently configured storage engine is instantiated and gets the files. To determine which store file structure was previously used for the files, we can write the storage engine class name into the files' metadata. If different files have different engines, the latter files are assumed to contain the most recent information (for two cases - first, when the structure reuses the files from old structure without rewriting, and for recovery case where it is assumed that the storage engine can recognize incomplete migration and resume it). If no such class name is present, the default file structure is assumed. Internally, storage engine can use additional metadata; thus, for example, level engine only has to have its class name saved in the latest L0 files - all other files will be implicitly recognized by it, as they have the level embedded.
Alternatively manifest with file list and metadata can make this simpler, however it will introduce other complexity around file management.
For the first cut, since we only have two engines, they will have to be able to recognize and convert to and from each other explicitly, or fail if the structure is not recognized.
Later (hand-wavy here), we may add the major compaction based conversion, where both engines would be instantiated and then major compaction will be performed with readers from one and writers from the other.
***** Level compaction in HBase notes.
*** File structure.
For the first cut, we can probably avoid the manifest file from the original description, since we already have metadata in store files. However, see "Other random notes".
*** Metadata changes and file ordering.
Each file will gain new fields in metadata - level, and first and last row key. The latter are there for convenience, they can be derived.
Also, we can no longer rely on store file seqNums for sorting files by age for a particular request, despite the data moving strictly up the levels.
Why? Suppose we compact across levels Ln and Ln+1 and add files into Ln+1. By definition the aggregate range of the files coming from Ln (Rn) will be a sub-range of the same for Ln+1 (Rn+1). Also, the ranges of the newly written Ln+1-level files intersect the ranges of the older Ln+1-level files arbitrarily. Therefore, some of the new files may have a combination of data from Rn (coming from Ln and Ln+1 files), and the data from (Rn+1 minus Rn) (coming only from Ln+1 files). For such file, max seqNum may be from Ln file. This seqNum has no invariants with regard to seqNums of other files still in Ln. However, there may be data remaining in Ln that belongs to (Rn+1 minus Rn). Thus, we have two files with data from the same range that are logically ordered by age, but the relation of their max seqNums is not determined.
This problem doesn't arise for Rn because Ln would have zero keys from Rn at the end of compaction. Therefore we can avoid this issue and have files strictly ordered by seqNum in such manner that for any key the age ordering would be valid; we can either to ensure that (Rn+1 minus Rn) is empty, which is unworkable, or ensure that the "mixed" files do not exist, e.g. both boundaries of Rn become range boundaries for some Ln+1 files (see also "Other random notes").
Thus, the files will be stored and selected by level. There are some simple optimizations possible to avoid enumerating files/traversing this structure. The exception is L0, where file ranges can intersect. They would be additionally sorted by seqNum.
*** Migration.
(see also the general Migration section above; here it's assumed we know the existing and desired structure).
Migrating existing structure to level structure can be accomplished by treating all existing files as L0 and performing one compaction into L1 (if we are lucky and there are files with non-intersecting ranges, we can do it in several steps by first choosing a subset of new L0 files; but in most cases we can assume all the old files will have full key range). This will have the same load as doing a major compaction once. This is also useful for the default case where level is stored in files; if conversion is performed back and forth from level to default to level, and default compacts some files in the meantime, destroying parts the level structure, we don't want to end up reusing old and new levels during recovery. This may be simplified if there's manifest.
Migrating level structure back, by default, will also require rewriting all data except:
1) L0 files, which are validly sorted by seqNum.
2) Files F on level L for which the following is true: for any file F2 in any Ln, where n > L, the range of F2 is either disjoint or a subset of the range of F. Sadly we wouldn't normally expect many such files.
This is to avoid the cases described in "Metadata changes" section.
If the Ln+1 boundary trick is performed (see "Other random notes"), simple ordering of many files by seqNum can be done, with normal compactions eventually merging them. However, compactions in this case will have to be aware of the updated file ranges; and we will need to double check the existing logic for assumptions (see "Other random notes").
*** Store file management changes (hints at StoreFileManager boundaries).
* Adding files after flush/bulk files - files will be added to L0 with the highest seqNum.
* Splicing files in after compaction - different in a straightforward manner.
* Loading files - if we don't have the manifest file, the changes are straightforward - files inserted into new structure, etc. For migration between structures, see "Migration".
* Compaction. All of the compaction logic from HStore will obviously have to go into a swappable class, like the default parts of compaction policy, initial candidate selection, the way parallel compactions are restricted, etc. All of these for level compaction are more or less straightforward.
* Finding split point. "Middle-of-the-biggest-file" heuristic is no longer available. One simple heuristic could be - build expected size distribution by taking all segments of key space (e.g. dividing it by range boundaries of all files into small chunks) and estimating the contribution of each file to each segment by assuming linear distribution in the file. Then calculating the middle by weight.
* Getting the key before a key (K) - becomes more complicated as follows:
1) As before, we sort the files, and start looking for Cn <= K in the files one by one.
2) Once found and if Cn < K, we have to check the remaining files, updating Cn as we find better candidates (Cn < Cn+1 <= K). We can restrict the search to files whose ranges intersect (Cn, K].
3) The complexity arises if we don't find Cn in any files with range [Ai, Bi) where (Ai <= K < Bi). There is no strict method to order all the files [Ai, Bi) where Bi < K for finding next best Cn, other than some heuristic.
4) For the case (3), once we find Cn we have to do the same check as (2), potentially involving multiple files from each level (to be searched in reverse key order within each level).
* Get request file selection. Trivial - all the files where key is in range, ordered by level, then by seqNum (for level 0).
* Scan request file selection - the files picked will be all the files with ranges overlapping the scan range. There will be some changes with regard to file ordering (e.g. for example KVScannerComparator will have to use levels, followed by seqNums, rather than just seqNums).
*** Other random notes.
* One algorithm change we can consider that will:
- allow sorting the files by seqNum (sortof, see below);
- reduce I/O;
- increase storage space.
is as follows. When we compact the files from Ln and Ln+1, we can keep some files on Ln+1, reducing their "effective" range. E.g. if on Ln we do compaction for a file with range [4, 13), and on Ln+1 we have 3 matching files: [0, 5) [5, 10) and [10, 15), we can write out new files in [4, 13) on Ln+1, but instead of rewriting the data for [0, 5) and [10, 15) files into new files, we can keep them and note that they only have useful data in [0, 4) and [13, 15) ranges respectively. That will require preserving these ranges somewhere - either in manifest, or in new file metadata. Both will complicate loading store files, and manifest will also complicate store file management.
Also, note that this approach only guarantees ability to compare files by seqNum /for a particular key/. In general case, files with higher seqNum will still potentially have older data than file with lower seqNum.
* Another thing we can do which is simpler is taking multiple Ln files to reduce I/O during compaction of Ln and Ln+1. E.g. if we have Ln+1 file with range [0, 5), which gets picked up for compaction for Ln file [0, 2), we may notice that there's another Ln file with range [2, 4) (e.g. strictly within [0, 5)) and pick it up too.