[HBASE-4218] Data Block Encoding of KeyValues (aka delta encoding / prefix compression - ASF JIRA

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.94.0
Fix Version/s: 0.94.0
Component/s: io
Labels:
- compression

Hadoop Flags:

Reviewed
Release Note:

Hide
Adds a block compression that stores the diff from the previous key only. Good for big keys and small value datasets. Makes writing and scanning slower but because the blocks compressed with this feature stay compressed when in memory up in the block cache, more data is cached. Off by default (DATA_BLOCK_ENCODING=NONE on column descriptor). To enable, set DATA_BLOCK_ENCODING to PREFIX, DIFF or FAST_DIFF on the column descriptor. Set ENCODE_ON_DISK to true on column descriptor to have the encoding in place out in the hfile (on by default).

Show
Adds a block compression that stores the diff from the previous key only. Good for big keys and small value datasets. Makes writing and scanning slower but because the blocks compressed with this feature stay compressed when in memory up in the block cache, more data is cached. Off by default (DATA_BLOCK_ENCODING=NONE on column descriptor). To enable, set DATA_BLOCK_ENCODING to PREFIX, DIFF or FAST_DIFF on the column descriptor. Set ENCODE_ON_DISK to true on column descriptor to have the encoding in place out in the hfile (on by default).

Description

A compression for keys. Keys are sorted in HFile and they are usually very similar. Because of that, it is possible to design better compression than general purpose algorithms,

It is an additional step designed to be used in memory. It aims to save memory in cache as well as speeding seeks within HFileBlocks. It should improve performance a lot, if key lengths are larger than value lengths. For example, it makes a lot of sense to use it when value is a counter.

Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) shows that I could achieve decent level of compression:
key compression ratio: 92%
total compression ratio: 85%
LZO on the same data: 85%
LZO after delta encoding: 91%
While having much better performance (20-80% faster decompression ratio than LZO). Moreover, it should allow far more efficient seeking which should improve performance a bit.

It seems that a simple compression algorithms are good enough. Most of the savings are due to prefix compression, int128 encoding, timestamp diffs and bitfields to avoid duplication. That way, comparisons of compressed data can be much faster than a byte comparator (thanks to prefix compression and bitfields).

In order to implement it in HBase two important changes in design will be needed:
-solidify interface to HFileBlock / HFileReader Scanner to provide seeking and iterating; access to uncompressed buffer in HFileBlock will have bad performance
-extend comparators to support comparison assuming that N first bytes are equal (or some fields are equal)

Link to a discussion about something similar:
http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

open-source.diff
08/Oct/11 00:55
340 kB
Jacek Migdal
Delta-encoding-2012-01-25_16_32_14.patch
26/Jan/12 00:32
514 kB
Mikhail Gryzykhin
Delta-encoding-2012-01-25_00_45_29.patch
25/Jan/12 08:48
513 kB
Mikhail Gryzykhin
Delta-encoding-2012-01-17_11_09_09.patch
17/Jan/12 19:09
499 kB
Mikhail Gryzykhin
Delta-encoding.patch-2012-01-13_12_20_07.patch
13/Jan/12 20:20
464 kB
Mikhail Gryzykhin
Delta-encoding.patch-2012-01-07_14_12_48.patch
07/Jan/12 22:13
444 kB
Mikhail Gryzykhin
Delta-encoding.patch-2012-01-05_18_50_47.patch
06/Jan/12 02:52
444 kB
Mikhail Gryzykhin
Delta-encoding.patch-2012-01-05_16_31_44.patch
06/Jan/12 00:32
439 kB
Mikhail Gryzykhin
Delta-encoding.patch-2012-01-05_16_31_44_copy.patch
06/Jan/12 01:58
439 kB
Mikhail Gryzykhin
Delta-encoding.patch-2012-01-05_15_16_43.patch
05/Jan/12 23:16
439 kB
Mikhail Gryzykhin
Delta-encoding.patch-2011-12-22_11_52_07.patch
22/Dec/11 19:52
409 kB
Mikhail Gryzykhin
Delta_encoding_with_memstore_TS.patch
29/Nov/11 02:07
376 kB
Mikhail Gryzykhin
Data-block-encoding-2011-12-23.patch
23/Dec/11 22:47
409 kB
Ted Yu
ASF.LICENSE.NOT.GRANTED--D447.9.patch
21/Dec/11 01:47
372 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.8.patch
14/Dec/11 01:50
370 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.7.patch
13/Dec/11 02:59
360 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.6.patch
12/Dec/11 19:59
359 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.5.patch
29/Nov/11 01:41
357 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.4.patch
22/Nov/11 20:38
327 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.3.patch
15/Nov/11 23:47
358 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.26.patch
26/Jan/12 00:29
487 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.25.patch
25/Jan/12 02:46
486 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.24.patch
17/Jan/12 18:52
473 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.23.patch
14/Jan/12 02:18
479 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.22.patch
13/Jan/12 20:18
438 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.21.patch
07/Jan/12 22:20
419 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.20.patch
06/Jan/12 02:49
419 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.2.patch
15/Nov/11 23:27
357 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.19.patch
06/Jan/12 00:33
414 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.18.patch
05/Jan/12 23:13
414 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.17.patch
03/Jan/12 19:36
402 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.16.patch
02/Jan/12 04:29
407 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.15.patch
29/Dec/11 03:42
385 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.14.patch
28/Dec/11 22:59
387 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.13.patch
22/Dec/11 19:49
389 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.12.patch
22/Dec/11 18:51
388 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.11.patch
22/Dec/11 02:51
389 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.10.patch
21/Dec/11 02:47
372 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D447.1.patch
15/Nov/11 02:00
371 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D1659.3.patch
17/Feb/12 00:51
471 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D1659.2.patch
10/Feb/12 02:26
471 kB
Phabricator
ASF.LICENSE.NOT.GRANTED--D1659.1.patch
09/Feb/12 02:31
469 kB
Phabricator
4218-v16.txt
02/Jan/12 04:58
407 kB
Ted Yu
4218-2012-01-14.txt
14/Jan/12 16:04
479 kB
Ted Yu
4218.txt
04/Jan/12 05:03
402 kB
Ted Yu
0001-Delta-encoding-fixed-encoded-scanners.patch
13/Dec/11 02:59
379 kB
Mikhail Gryzykhin
0001-Delta-encoding.patch
22/Dec/11 02:53
409 kB
Mikhail Gryzykhin

Issue Links

relates to

HBASE-14323 Encoding rpc payload instead of compression

Closed

Data Block Encoding of KeyValues (aka delta encoding / prefix compression

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates