Level compactions on HBase - design and code notes.
***** Introduction.
The familiarity with levelDB compaction description is assumed.
The "range" below refers to the key range of a store file unless otherwise specified.
***** General notes (not just for level compaction).
*** How to make things swappable.
Level compactions require different store file structure compared to the current one - there is no way to load level-compacted files into linear list of files and use them via old code; therefore, a lot of store-related things will have to be pluggable to make level compaction possible. On the other hand, different compaction algorithms like tiered compaction may replace just the file selection policy, keeping the store file structure intact; yet others may replace the compactor to optimize it, or whatever.
Therefore, the suggested way to make compactions pluggable is as such. There will be just one pluggable class, called StorageEngine (name open to debate). It will serve merely as a factory of objects of several interface types, such as StoreFileManager (not the same as HStore!), Compactor, CompactionPolicy, and maybe others. Each of these objects will be independent in a sense that implementations will be usable in different combinations, with StorageEngine author being responsible for making the valid one.
By default, StorageEngine will contain normal store file management and default compactor and policy. Level compaction will replace all of these. Tier compaction will replace just the policy, etc.
*** Migration.
How do we figure out what file structure we are dealing with?
TBD (I have it on a piece of paper :)).
***** Level compaction in HBase notes.
*** File structure.
For the first cut, we can probably avoid the manifest file from the original description, since we already have metadata in store files. However, see "Other random notes".
*** Metadata changes and file ordering.
Each file will gain new fields in metadata - level, and first and last row key. The latter are there for convenience, they can be derived.
Also, we can no longer rely on store file seqNums for sorting files by age for a particular request, despite the data moving strictly up the levels.
Why? Suppose we compact across levels Ln and Ln+1 and add files into Ln+1. By definition the aggregate range of the files coming from Ln (Rn) will be a sub-range of the same for Ln+1 (Rn+1). Also, the ranges of the newly written Ln+1-level files intersect the ranges of the older Ln+1-level files arbitrarily. Therefore, some of the new files may have a combination of data from Rn (coming from Ln and Ln+1 files), and the data from (Rn+1 minus Rn) (coming only from Ln+1 files). For such file, max seqNum may be from Ln file. This seqNum has no invariants with regard to seqNums of other files still in Ln. However, there may be data remaining in Ln that belongs to (Rn+1 minus Rn). Thus, we have two files with data from the same range that are logically ordered by age, but the relation of their max seqNums is not determined.
This problem doesn't arise for Rn because Ln would have zero keys from Rn at the end of compaction. Therefore we can avoid this issue and have files strictly ordered by seqNum in such manner that for any key the age ordering would be valid; we can either to ensure that (Rn+1 minus Rn) is empty, which is unworkable, or ensure that the "mixed" files do not exist, e.g. both boundaries of Rn become range boundaries for some Ln+1 files (see also "Other random notes").
Thus, the files will be stored and selected by level. There are some simple optimizations possible to avoid enumerating files/traversing this structure. The exception is L0, where file ranges can intersect. They would be additionally sorted by seqNum.
*** Migration.
How do we migrate from/to level-based structure with minimum pain?
TBD (I have it on a piece of paper :)).
*** Store file management changes (hints at StoreFileManager boundaries).
* Adding files after flush/bulk files - files will be added to L0 with the highest seqNum.
* Splicing files in after compaction - different in a straightforward manner.
* Loading files - if we don't have the manifest file, the changes are straightforward - files inserted into new structure, etc. For migration between structures, see "Migration".
* Compaction. All of the compaction logic from HStore will obviously have to go into a swappable class, like the default parts of compaction policy, initial candidate selection, the way parallel compactions are restricted, etc. All of these for level compaction are more or less straightforward.
* Finding split point. "Middle-of-the-biggest-file" heuristic is no longer available. One simple heuristic could be - build expected size distribution by taking all segments of key space (e.g. dividing it by range boundaries of all files into small chunks) and estimating the contribution of each file to each segment by assuming linear distribution in the file. Then calculating the middle by weight.
* Getting the key before a key (K) - becomes more complicated as follows:
1) As before, we sort the files, and start looking for Cn <= K in the files one by one.
2) Once found and if Cn < K, we have to check the remaining files, updating Cn as we find better candidates (Cn < Cn+1 <= K). We can restrict the search to files whose ranges intersect (Cn, K].
3) The complexity arises if we don't find Cn in any files with range [Ai, Bi) where (Ai <= K < Bi). There is no strict method to order all the files [Ai, Bi) where Bi < K for finding next best Cn, other than some heuristic.
4) For the case (3), once we find Cn we have to do the same check as (2), potentially involving multiple files from each level (to be searched in reverse key order within each level).
* Get request file selection. Trivial - all the files where key is in range, ordered by level, then by seqNum (for level 0).
* Scan request file selection - all the overlapping ranges. Seems like it should not change KeyValueHeap/etc., but need to check how it handles multiple values, e.g. does it rely on seqNums for age? TBD.
*** Other random notes.
* One algorithm change we can consider that will:
- allow sorting the files by seqNum;
- reduce I/O;
- increase storage space.
is as follows. When we compact the files from Ln and Ln+1, we can keep some files on Ln+1, reducing their "effective" range. E.g. if on Ln we do compaction for a file with range [4, 13), and on Ln+1 we have 3 matching files: [0, 5) [5, 10) and [10, 15), we can write out new files in [4, 12) on Ln+1, but instead of rewriting the data for [0, 5) and [10, 15) files into new files, we can keep them and note that they only have useful data in [0, 4) and [13, 15) ranges respectively. That will require preserving these ranges somewhere - either in manifest, or in new file metadata. Both will complicate loading store files, and manifest will also complicate store file management.
* Another thing we can do which is simpler is taking multiple Ln files to reduce I/O during compaction of Ln and Ln+1. E.g. if we have Ln+1 file with range [0, 5), which gets picked up for compaction for Ln file [0, 2), we may notice that there's another Ln file with range [2, 4) (e.g. strictly within [0, 5)) and pick it up too.