[HBASE-4676] Prefix Compression - Trie data block encoding - ASF JIRA

Details

Type: New Feature
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.95.2
Fix Version/s: 0.95.0
Component/s: io, Performance, regionserver
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
Experimental: please use on non-critical data for now. PREFIX_TREE encoder stores row keys in a trie layout and uses other encoding techniques for the cells inside a row. It provides similar memory savings to existing encoders, but has faster random access at a cost of slower encoding speed. Set encoding => PREFIX_TREE to enable for a column family.

Show
Experimental: please use on non-critical data for now. PREFIX_TREE encoder stores row keys in a trie layout and uses other encoding techniques for the cells inside a row. It provides similar memory savings to existing encoders, but has faster random access at a cost of slower encoding speed. Set encoding => PREFIX_TREE to enable for a column family.
Tags:
0.96notable

Description

The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.

First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.

Second, there is no random access to KeyValues inside data blocks. This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2. The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.

The PrefixTrie block encoding format attempts to solve both of these problems. Some features:

trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
the column family is currently stored once at the beginning of each block. this could easily be modified to allow multiple family names per block
all qualifiers in the block are stored in their own trie format which caters nicely to wide rows. duplicate qualifers between rows are eliminated. the size of this trie determines the width of the block's qualifier fixed-width-int
the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that. the maximum delta determines the width of the block's timestamp fixed-width-int

The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values. Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row. Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].

If all operation types are the same for a block, there will be zero per-cell overhead. Same for timestamps. Same for qualifiers when i get a chance. So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.

A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed. Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).

One current drawback is the current write speed. While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path. Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase. It will still be slower than delta encoding, but with a much higher decode speed. I have not yet created a thorough benchmark for write speed nor sequential read speed.

Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal. The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between. When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.

Current code is on github. The trie code is in a separate project than the slightly modified hbase. There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.

https://github.com/hotpads/hbase/tree/prefix-trie-1
https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners

I'll follow up later with more implementation ideas.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

4676-prefix-tree-trunk-v16.patch
06/Feb/13 18:12
398 kB
Ted Yu
4676-prefix-tree-trunk-v17.patch
06/Feb/13 23:26
398 kB
Ted Yu
HBASE-4676-0.94-v1.patch
11/Apr/12 02:24
9 kB
Matt Corgan
HBASE-4676-common-and-server-v8.patch
14/Nov/12 22:58
161 kB
Matt Corgan
HBASE-4676-prefix-tree-trunk-v1.patch
15/Oct/12 05:05
591 kB
Matt Corgan
HBASE-4676-prefix-tree-trunk-v10.patch
23/Dec/12 07:17
396 kB
Matt Corgan
HBASE-4676-prefix-tree-trunk-v11.patch
23/Dec/12 20:14
396 kB
Matt Corgan
HBASE-4676-prefix-tree-trunk-v12.patch
04/Jan/13 05:27
398 kB
Matt Corgan
HBASE-4676-prefix-tree-trunk-v13.patch
04/Feb/13 06:34
398 kB
Michael Stack
HBASE-4676-prefix-tree-trunk-v13.patch
04/Jan/13 19:17
398 kB
Matt Corgan
HBASE-4676-prefix-tree-trunk-v14.patch
04/Feb/13 07:31
398 kB
Matt Corgan
HBASE-4676-prefix-tree-trunk-v15.patch
06/Feb/13 07:43
398 kB
Matt Corgan
HBASE-4676-prefix-tree-trunk-v2.patch
24/Oct/12 07:00
599 kB
Matt Corgan
HBASE-4676-prefix-tree-trunk-v3.patch
29/Oct/12 04:01
606 kB
Matt Corgan
HBASE-4676-prefix-tree-trunk-v4.patch
09/Nov/12 06:38
581 kB
Matt Corgan
HBASE-4676-prefix-tree-trunk-v5.patch
09/Nov/12 08:11
582 kB
Matt Corgan
HBASE-4676-prefix-tree-trunk-v6.patch
10/Nov/12 00:37
583 kB
Matt Corgan
HBASE-4676-prefix-tree-trunk-v7.patch
14/Nov/12 02:57
590 kB
Matt Corgan
HBASE-4676-prefix-tree-trunk-v8.patch
10/Dec/12 05:37
489 kB
Matt Corgan
HBASE-4676-prefix-tree-trunk-v9.patch
21/Dec/12 05:56
396 kB
Matt Corgan
hbase-prefix-trie-0.1.jar
11/Apr/12 02:24
311 kB
Matt Corgan
PrefixTrie_Format_v1.pdf
14/Mar/12 19:58
320 kB
Matt Corgan
PrefixTrie_Performance_v1.pdf
14/Mar/12 20:45
59 kB
Matt Corgan
SeeksPerSec by blockSize.png
25/Oct/11 23:15
22 kB
Matt Corgan

Issue Links

incorporates

HBASE-7162 Prefix Compression - Trie data block encoding; hbase-common and hbase-server changes

Closed

is blocked by

HBASE-6162 Move KeyValue to hbase-common module

Closed

is related to

HBASE-6907 KeyValue equals and compareTo methods should match

Closed

HBASE-9943 Big linked list test fails with encoding PREFIX_TREE

Closed

HBASE-27329 Introduce prefix tree index block encoding use less space

Open

HBASE-4336 Convert source tree into maven modules

Closed

relates to

HBASE-7448 Add tags to Cell API

Closed

HBASE-5977 Usage of modules

Closed

(1 is related to, 2 relates to)

Prefix Compression - Trie data block encoding

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates