diff --git pom.xml pom.xml index bb76f5c..1619618 100644 --- pom.xml +++ pom.xml @@ -804,9 +804,14 @@ true true - ../images/ + ${basedir}/src/main/docbkx/images/ ../css/freebsd_docbook.css ${basedir}/target/docbkx/book + + + + + @@ -816,9 +821,14 @@ pre-site - images/ + ${basedir}/src/main/docbkx/images/ css/freebsd_docbook.css ${basedir}/target/docbkx/ + + + + + diff --git src/main/docbkx/book.xml src/main/docbkx/book.xml index 6a34467..1f82ce2 100644 --- src/main/docbkx/book.xml +++ src/main/docbkx/book.xml @@ -4392,225 +4392,422 @@ This option should not normally be used, and it is not in -fixAll. - - - Compression In HBase<indexterm><primary>Compression</primary></indexterm> - There are a bunch of compression options in HBase. Some codecs come with java -- - e.g. gzip -- and so require no additional installations. Others require native - libraries. The native libraries may be available in your hadoop as is the case - with lz4 and it is just a matter of making sure the hadoop native .so is available - to HBase. You may have to do extra work to make the codec accessible; for example, - if the codec has an apache-incompatible license that makes it so hadoop cannot bundle - the library. - Below we - discuss what is necessary for the common codecs. Whatever codec you use, be sure - to test it is installed properly and is available on all nodes that make up your cluster. - Add any necessary operational step that will ensure checking the codec present when you - happen to add new nodes to your cluster. The - discussed below can help check the codec is properly install. - As to which codec to use, there is some helpful discussion - to be found in Documenting Guidance on compression and codecs. - + + + Compression and Data Block Encoding In + HBase<indexterm><primary>Compression</primary><secondary>Data Block + Encoding</secondary><seealso>codecs</seealso></indexterm> + Some of the information in this section is pulled from a discussion on the + HBase Development mailing list. + HBase supports several different compression algorithms which can be enabled on a + ColumnFamily. Data block encoding attempts to limit duplication of information in keys, taking + advantage of some of the fundamental designs and patterns of HBase, such as sorted row keys + and the schema of a given table. Compressors reduce the size of large, opaque byte arrays in + cells, and can significantly reduce the storage space needed to store uncompressed + data. + Compressors and data block encoding can be used together on the same ColumnFamily. + + + Changes Take Effect Upon Compaction + If you change compression or encoding for a ColumnFamily, the changes take effect during + compaction. + + + To configure HBase to use a compressor, see . To enable a compressor for a ColumnFamily, see . To enable data block encoding for a ColumnFamily, see + . + + Block Compressors + + none + + + Snappy + + + LZO + + + LZ4 + + + GZ + + -
- CompressionTest Tool - - HBase includes a tool to test compression is set up properly. - To run it, type /bin/hbase org.apache.hadoop.hbase.util.CompressionTest. - This will emit usage on how to run the tool. - - You need to restart regionserver for it to pick up changes! - Be aware that the regionserver caches the result of the compression check it runs - ahead of each region open. This means that you will have to restart the regionserver - for it to notice that you have fixed any codec issues; e.g. changed symlinks or - moved lib locations under HBase. - - On the location of native libraries - Hadoop looks in lib/native for .so files. HBase looks in - lib/native/PLATFORM. See the bin/hbase. - View the file and look for native. See how we - do the work to find out what platform we are running on running a little java program - org.apache.hadoop.util.PlatformName to figure it out. - We'll then add ./lib/native/PLATFORM to the - LD_LIBRARY_PATH environment for when the JVM starts. - The JVM will look in here (as well as in any other dirs specified on LD_LIBRARY_PATH) - for codec native libs. If you are unable to figure your 'platform', do: - $ ./bin/hbase org.apache.hadoop.util.PlatformName. - An example platform would be Linux-amd64-64. - - -
-
- - <varname> - hbase.regionserver.codecs - </varname> - - - To have a RegionServer test a set of codecs and fail-to-start if any - code is missing or misinstalled, add the configuration - - hbase.regionserver.codecs - - to your hbase-site.xml with a value of - codecs to test on startup. For example if the - - hbase.regionserver.codecs - value is lzo,gz and if lzo is not present - or improperly installed, the misconfigured RegionServer will fail - to start. - - - Administrators might make use of this facility to guard against - the case where a new server is added to cluster but the cluster - requires install of a particular coded. - -
+ + Data Block Encoding Types + + Prefix - Often, keys are very similar. Specifically, keys often share a common prefix + and only differ near the end. For instance, one key might be + RowKey:Family:Qualifier0 and the next key might be + RowKey:Family:Qualifier1. In Prefix encoding, an extra column is + added which holds the length of the prefix shared between the current key and the previous + key. Assuming the first key here is totally different from the key before, its prefix + length is 0. The second key's prefix length is 23, since they have the + first 23 characters in common. + Obviously if the keys tend to have nothing in common, Prefix will not provide much + benefit. + The following image shows a hypothetical ColumnFamily with no data block encoding. +
+ ColumnFamily with No Encoding + + + + + + + +
+ Here is the same data with prefix data encoding. +
+ ColumnFamily with Prefix Encoding + + + + + + + +
+
+ + Diff - Diff encoding expands upon Prefix encoding. Instead of considering the key + sequentially as a monolithic series of bytes, each key field is split so that each part of + the key can be compressed more efficiently. Two new fields are added: timestamp and type. + If the ColumnFamily is the same as the previous row, it is omitted from the current row. + If the key length, value length or type are the same as the previous row, the field is + omitted. In addition, for increased compression, the timestamp is stored as a Diff from + the previous row's timestamp, rather than being stored in full. Given the two row keys in + the Prefix example, and given an exact match on timestamp and the same type, neither the + value length, or type needs to be stored for the second row, and the timestamp value for + the second row is just 0, rather than a full timestamp. + Diff encoding is disabled by default because writing and scanning are slower but more + data is cached. + This image shows the same ColumnFamily from the previous images, with Diff encoding. +
+ ColumnFamily with Diff Encoding + + + + + + + +
+
+ + Fast Diff - Fast Diff works similar to Diff, but adds another field which stores a + single bit to track whether the data itself is the same as the previous row. If it is, the + data is not stored again. Fast Diff is the recommended codec to use if you have long keys + or many columns. + + + + Prefix Tree encoding was introduced as an experimental feature in HBase 0.96. It + provides similar memory savings to the Prefix, Diff, and Fast Diff encoder, but provides + faster random access at a cost of slower encoding speed. Prefix Tree may be appropriate + for applications that have high block cache hit ratios. It introduces new 'tree' fields + for the row and column. The row tree field contains a list of offsets/references + corresponding to the cells in that row. This allows for a good deal of compression. For + more details about Prefix Tree encoding, see HBASE-4676. + + +
-
- - GZIP - - - GZIP will generally compress better than LZO but it will run slower. - For some setups, better compression may be preferred ('cold' data). - Java will use java's GZIP unless the native Hadoop libs are - available on the CLASSPATH; in this case it will use native - compressors instead (If the native libs are NOT present, - you will see lots of Got brand-new compressor - reports in your logs; see ). - +
+ Which Compression or Codec To Use + The compression or codec type to use depends on the characteristics of your data. + Choosing the wrong type could cause your data to take more space rather than less, and can + have performance implications. In general, you need to weigh your options between smaller + size and faster compression/decompression. Following are some general guidelines. + + + If you have long keys (compared to the values) or many columns, use a prefix + encoder. FAST_DIFF is recommended, as more testing is needed for Prefix Tree + encoding. + + + If the values are large (and not precompressed, such as images), use a data block + compressor. + + + Use GZIP for cold data, which is accessed infrequently. GZIP + compression uses more CPU resources than Snappy or LZO, but provides a higher + compression ratio. + + + Use Snappy or LZO for hot data, which is accessed + frequently. Snappy and LZO use fewer CPU resources than GZIP, but do not provide as high + of a compression ratio. + + + In most cases, enabling Snappy or LZO by default is a good choice, because they have + a low performance overhead and provide space savings. + + + Before Snappy became available by Google in 2011, LZO was the default. Snappy has + similar qualities as LZO but has been shown to perform better. + +
-
- - LZ4 - - - LZ4 is bundled with Hadoop. Make sure the hadoop .so is - accessible when you start HBase. One means of doing this is after figuring your - platform, see , make a symlink from HBase - to the native Hadoop libraries presuming the two software installs are colocated. - For example, if my 'platform' is Linux-amd64-64: - $ cd $HBASE_HOME -$ mkdir lib/native -$ ln -s $HADOOP_HOME/lib/native lib/native/Linux-amd64-64 - Use the compression tool to check lz4 installed on all nodes. - Start up (or restart) hbase. From here on out you will be able to create - and alter tables to enable LZ4 as a compression codec. E.g.: - hbase(main):003:0> alter 'TestTable', {NAME => 'info', COMPRESSION => 'LZ4'} - -
+
+ Compressor Configuration, Installation, and Use +
+ Configure HBase For Compressors + Before HBase can use a given compressor, its libraries need to be available. Due to + licensing issues, only GZ compression is available to HBase (via native Java libraries) in + a default installation. +
+ Install GZ Support Via Native Libraries + HBase uses Java's built-in GZip support unless the native Hadoop libraries are + available on the CLASSPATH. The recommended way to add libraries to the CLASSPATH is to + set the environment variable HBASE_LIBRARY_PATH for the user running + HBase. If native libraries are not available and Java's GZIP is used, Got + brand-new compressor reports will be present in the logs. See ). +
-
- - LZO - - Unfortunately, HBase cannot ship with LZO because of - the licensing issues; HBase is Apache-licensed, LZO is GPL. - Therefore LZO install is to be done post-HBase install. - See the Using LZO Compression - wiki page for how to make LZO work with HBase. - - A common problem users run into when using LZO is that while initial - setup of the cluster runs smooth, a month goes by and some sysadmin goes to - add a machine to the cluster only they'll have forgotten to do the LZO - fixup on the new machine. In versions since HBase 0.90.0, we should - fail in a way that makes it plain what the problem is, but maybe not. - See - for a feature to help protect against failed LZO install. -
-
- - SNAPPY - - - If snappy is installed, HBase can make use of it (courtesy of - hadoop-snappy - See Alejandro's note up on the list on difference between Snappy in Hadoop - and Snappy in HBase). +
+ Install LZO Support + HBase cannot ship with LZO because of incompatibility between HBase, which uses an + Apache Software License (ASL) and LZO, which uses a GPL license. See the Using LZO + Compression wiki page for information on configuring LZO support for HBase. + If you depend upon LZO compression, consider configuring your RegionServers to fail + to start if LZO is not available. See . +
- - - - Build and install snappy on all nodes - of your cluster (see below). HBase nor Hadoop cannot include snappy because of licensing issues (The - hadoop libhadoop.so under its native dir does not include snappy; of note, the shipped .so - may be for 32-bit architectures -- this fact has tripped up folks in the past with them thinking - it 64-bit). The notes below are about installing snappy for HBase use. You may want snappy - available in your hadoop context also. That is not covered here. - HBase and Hadoop find the snappy .so in different locations currently: Hadoop picks those files in - ./lib while HBase finds the .so in ./lib/[PLATFORM]. - - - - - Use CompressionTest to verify snappy support is enabled and the libs can be loaded ON ALL NODES of your cluster: - $ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://host/path/to/hbase snappy - - - - - Create a column family with snappy compression and verify it in the hbase shell: - $ hbase> create 't1', { NAME => 'cf1', COMPRESSION => 'SNAPPY' } -hbase> describe 't1' - In the output of the "describe" command, you need to ensure it lists "COMPRESSION => 'SNAPPY'" - - +
+ Install Snappy Support + HBase does not ship with Snappy support because of licensing issues. You can install + Snappy binaries (for instance, by using yum install snappy on CentOS) + or build Snappy from source. After installing Snappy, search for the shared library, + which will be called libsnappy.so.X where X is a number. If you + built from source, copy the shared library to a known location on your system, such as + /opt/snappy/lib/. + In addition to the Snappy library, HBase also needs access to the Hadoop shared + library, which will be called something like libhadoop.so.X.Y, + where X and Y are both numbers. Make note of the location of the Hadoop library, or copy + it to the same location as the Snappy library. + + The Snappy and Hadoop libraries need to be available on each node of your cluster. + See to find out how to test that this is the case. + See to configure your RegionServers to fail to + start if a given compressor is not available. + + Each of these library locations need to be added to the environment variable + HBASE_LIBRARY_PATH for the operating system user that runs HBase. You + need to restart the RegionServer for the changes to take effect. +
-
-
-
- - Installation - - Snappy is used by hbase to compress HFiles on flush and when compacting. - - - You will find the snappy library file under the .libs directory from your Snappy build (For example - /home/hbase/snappy-1.0.5/.libs/). The file is called libsnappy.so.1.x.x where 1.x.x is the version of the snappy - code you are building. You can either copy this file into your hbase lib directory -- under lib/native/PLATFORM -- - naming the file as libsnappy.so, - or simply create a symbolic link to it (See ./bin/hbase for how it does library path for native libs). - +
+ CompressionTest + You can use the CompressionTest tool to verify that your compressor is available to + HBase: + + $ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://host/path/to/hbase snappy + +
- - The second file you need is the hadoop native library. You will find this file in your hadoop installation directory - under lib/native/Linux-amd64-64/ or lib/native/Linux-i386-32/. The file you are looking for is libhadoop.so.1.x.x. - Again, you can simply copy this file or link to it from under hbase in lib/native/PLATFORM (e.g. Linux-amd64-64, etc.), - using the name libhadoop.so. - +
+ Enforce Compression Settings On a RegionServer + You can configure a RegionServer so that it will fail to restart if compression is + configured incorrectly, by adding the option hbase.regionserver.codecs to the + hbase-site.xml, and setting its value to a comma-separated list + of codecs that need to be available. For example, if you set this property to + lzo,gz, the RegionServer would fail to start if both compressors + were not available. This would prevent a new server from being added to the cluster + without having codecs configured properly. +
+
- - At the end of the installation, you should have both libsnappy.so and libhadoop.so links or files present into - lib/native/Linux-amd64-64 or into lib/native/Linux-i386-32 (where the last part of the directory path is the - PLATFORM you built and rare running the native lib on) - - To point hbase at snappy support, in hbase-env.sh set - export HBASE_LIBRARY_PATH=/pathtoyourhadoop/lib/native/Linux-amd64-64 - In /pathtoyourhadoop/lib/native/Linux-amd64-64 you should have something like: - - libsnappy.a - libsnappy.so - libsnappy.so.1 - libsnappy.so.1.1.2 - - -
+
+ Enable Compression On a ColumnFamily + To enable compression for a ColumnFamily, use an alter command. You do + not need to re-create the table or copy data. If you are changing codecs, be sure the old + codec is still available until all the old StoreFiles have been compacted. + + Enabling Compression on a ColumnFamily of an Existing Table using HBase + Shell + disable 'test' +hbase> alter 'test', {NAME => 'cf', COMPRESSION => 'GZ'} +hbase> enable 'test']]> + + + + Creating a New Table with Compression On a ColumnFamily + create 'test2', { NAME => 'cf2', COMPRESSION => 'SNAPPY' } + ]]> + + + Verifying a ColumnFamily's Compression Settings + describe 'test' +DESCRIPTION ENABLED + 'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NONE false + ', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', + VERSIONS => '1', COMPRESSION => 'GZ', MIN_VERSIONS + => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'fa + lse', BLOCKSIZE => '65536', IN_MEMORY => 'false', B + LOCKCACHE => 'true'} +1 row(s) in 0.1070 seconds + ]]> + +
+ +
+ Testing Compression Performance + HBase includes a tool called LoadTestTool which provides mechanisms to test your + compression performance. You must specify either -write or + -update-read as your first parameter, and if you do not specify another + parameter, usage advice is printed for each option. + + <command>LoadTestTool</command> Usage + +Options: + -batchupdate Whether to use batch as opposed to separate + updates for every column in a row + -bloom Bloom filter type, one of [NONE, ROW, ROWCOL] + -compression Compression type, one of [LZO, GZ, NONE, SNAPPY, + LZ4] + -data_block_encoding Encoding algorithm (e.g. prefix compression) to + use for data blocks in the test column family, one + of [NONE, PREFIX, DIFF, FAST_DIFF, PREFIX_TREE]. + -encryption Enables transparent encryption on the test table, + one of [AES] + -generator The class which generates load for the tool. Any + args for this class can be passed as colon + separated after class name + -h,--help Show usage + -in_memory Tries to keep the HFiles of the CF inmemory as far + as possible. Not guaranteed that reads are always + served from inmemory + -init_only Initialize the test table only, don't do any + loading + -key_window The 'key window' to maintain between reads and + writes for concurrent write/read workload. The + default is 0. + -max_read_errors The maximum number of read errors to tolerate + before terminating all reader threads. The default + is 10. + -multiput Whether to use multi-puts as opposed to separate + puts for every column in a row + -num_keys The number of keys to read/write + -num_tables A positive integer number. When a number n is + speicfied, load test tool will load n table + parallely. -tn parameter value becomes table name + prefix. Each table name is in format + _1..._n + -read [:<#threads=20>] + -regions_per_server A positive integer number. When a number n is + specified, load test tool will create the test + table with n regions per server + -skip_init Skip the initialization; assume test table already + exists + -start_key The first key to read/write (a 0-based index). The + default value is 0. + -tn The name of the table to read or write + -update [:<#threads=20>][:<#whether to + ignore nonce collisions=0>] + -write :[:<#threads=20>] + -zk ZK quorum as comma-separated host names without + port numbers + -zk_root name of parent znode in zookeeper + ]]> + + + Example Usage of LoadTestTool + +$ hbase org.apache.hadoop.hbase.util.LoadTestTool -write 1:10:100 -num_keys 1000000 + -read 100:30 -num_tables 1 -data_block_encoding NONE -tn load_test_tool_NONE + + +
-
- Changing Compression Schemes - A frequent question on the dist-list is how to change compression schemes for ColumnFamilies. This is actually quite simple, - and can be done via an alter command. Because the compression scheme is encoded at the block-level in StoreFiles, the table does - not need to be re-created and the data does not copied somewhere else. Just make sure - the old codec is still available until you are sure that all of the old StoreFiles have been compacted. - + +
+ Enable Data Block Encoding + Codecs are built into HBase so no extra configuration is needed. Codecs are enabled on a + table by setting the DATA_BLOCK_ENCODING property. Disable the table before + altering its DATA_BLOCK_ENCODING setting. Following is an example using HBase Shell: + + Enable Data Block Encoding On a Table + disable 'test' +hbase> alter 'test', { NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST_DIFF' } +Updating all regions with the new schema... +0/1 regions updated. +1/1 regions updated. +Done. +0 row(s) in 2.2820 seconds +hbase> enable 'test' +0 row(s) in 0.1580 seconds + ]]> + + + Verifying a ColumnFamily's Data Block Encoding + describe 'test' +DESCRIPTION ENABLED + 'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST true + _DIFF', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => + '0', VERSIONS => '1', COMPRESSION => 'GZ', MIN_VERS + IONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS = + > 'false', BLOCKSIZE => '65536', IN_MEMORY => 'fals + e', BLOCKCACHE => 'true'} +1 row(s) in 0.0650 seconds + ]]> +
+ <link xlink:href="https://github.com/brianfrankcooper/YCSB/">YCSB: The Yahoo! Cloud Serving Benchmark</link> and HBase TODO: Describe how YCSB is poor for putting up a decent cluster load. diff --git src/main/docbkx/images/data_block_diff_encoding.png src/main/docbkx/images/data_block_diff_encoding.png new file mode 100644 index 0000000..270b9c0 Binary files /dev/null and src/main/docbkx/images/data_block_diff_encoding.png differ diff --git src/main/docbkx/images/data_block_no_encoding.png src/main/docbkx/images/data_block_no_encoding.png new file mode 100644 index 0000000..065d8d8 Binary files /dev/null and src/main/docbkx/images/data_block_no_encoding.png differ diff --git src/main/docbkx/images/data_block_prefix_encoding.png src/main/docbkx/images/data_block_prefix_encoding.png new file mode 100644 index 0000000..0afd193 Binary files /dev/null and src/main/docbkx/images/data_block_prefix_encoding.png differ