Copyright © 2011 Apache Software Foundation
| Revision History | ||
|---|---|---|
| Revision 0.91.0-SNAPSHOT | ||
| Adding first cuts at Configuration, Getting Started, Data Model | ||
| Revision 0.89.20100924 | 5 October 2010 | stack |
| Initial layout | ||
Abstract
This is the official book of Apache HBase, a distributed, versioned, column-oriented database built on top of Apache Hadoop and Apache ZooKeeper.
Table of Contents
hbase.regionserver.blockCacheCounthbase.regionserver.blockCacheFreehbase.regionserver.blockCacheHitRatiohbase.regionserver.blockCacheSizehbase.regionserver.compactionQueueSizehbase.regionserver.fsReadLatency_avg_timehbase.regionserver.fsReadLatency_num_opshbase.regionserver.fsSyncLatency_avg_timehbase.regionserver.fsSyncLatency_num_opshbase.regionserver.fsWriteLatency_avg_timehbase.regionserver.fsWriteLatency_num_opshbase.regionserver.memstoreSizeMBhbase.regionserver.regionshbase.regionserver.requestshbase.regionserver.storeFileIndexSizeMBhbase.regionserver.storeshbase.regionserver.storeFilesList of Tables
webtableanchorcontentsTable of Contents
See HBase and MapReduce up in javadocs. Start there. Below is some additional help.
When TableInputFormat, is used to source an HBase table in a MapReduce job, its splitter will make a map task for each region of the table. Thus, if there are 100 regions in the table, there will be 100 map-tasks for the job - regardless of how many column families are selected in the Scan.
To use HBase as a MapReduce source, the job would be configured via TableMapReduceUtil in the following manner...
Job job = ...; Scan scan = new Scan(); scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(false); // Now set other scan attrs ... TableMapReduceUtil.initTableMapperJob( tableName, // input HBase table name scan, // Scan instance to control CF and attribute selection MyMapper.class, // mapper Text.class, // reducer key LongWritable.class, // reducer value job // job instance );
...and the mapper instance would extend TableMapper...
public class MyMapper extends TableMapper<Text, LongWritable> {
public void map(ImmutableBytesWritable row, Result value, Context context)
throws InterruptedException, IOException {
// process data for the row from the Result instance.
Although the framework currently allows one HBase table as input to a MapReduce job, other HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating an HTable instance in the setup method of the Mapper.
public class MyMapper extends TableMapper<Text, LongWritable> {
private HTable myOtherTable;
@Override
public void setup(Context context) {
myOtherTable = new HTable("myOtherTable");
}
It is generally advisable to turn off speculative execution for MapReduce jobs that use HBase as a source. This can either be done on a per-Job basis through properties, on on the entire cluster. Especially for longer running jobs, speculative execution will create duplicate map-tasks which will double-write your data to HBase; this is probably not what you want.
Table of Contents
A good general introduction on the strength and weaknesses modelling on the various non-rdbms datastores is Ian Varleys' Master thesis, No Relation: The Mixed Blessings of Non-Relational Databases. Recommended.
HBase schemas can be created or updated with ??? or by using HBaseAdmin in the Java API.
Tables must be disabled when making ColumnFamily modifications, for example..
Configuration config = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf);
String table = "myTable";
admin.disableTable(table);
HColumnDescriptor cf1 = ...;
admin.addColumn(table, cf1 ); // adding new ColumnFamily
HColumnDescriptor cf2 = ...;
admin.modifyColumn(table, cf2 ); // modifying existing ColumnFamily
admin.enableTable(table);
See ??? for more information about configuring client connections.
HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low. Currently, flushing and compactions are done on a per Region basis so if one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed though the amount of data they carry is small. Compaction is currently triggered by the total number of files under a column family. Its not size based. When many column families the flushing and compaction interaction can make for a bunch of needless i/o loading (To be addressed by changing flushing and compaction to work on a per column family basis).
Try to make do with one column family if you can in your schemas. Only introduce a second and third column family in the case where data access is usually column scoped; i.e. you query one column family or the other but usually not both at the one time.
In the HBase chapter of Tom White's book Hadoop: The Definitive Guide (O'Reilly) there is a an optimization note on watching out for a phenomenon where an import process walks in lock-step with all clients in concert pounding one of the table's regions (and thus, a single node), then moving onto the next region, etc. With monotonically increasing row-keys (i.e., using a timestamp), this will happen. See this comic by IKai Lan on why monotically increasing row keys are problematic in BigTable-like datastores: monotonically increasing values are bad. The pile-up on a single region brought on by monoticially increasing keys can be mitigated by randomizing the input records to not be in sorted order, but in general its best to avoid using a timestamp or a sequence (e.g. 1, 2, 3) as the row-key.
If you do need to upload time series data into HBase, you should study OpenTSDB as a successful example. It has a page describing the schema it uses in HBase. The key format in OpenTSDB is effectively [metric_type][event_timestamp], which would appear at first glance to contradict the previous advice about not using a timestamp as the key. However, the difference is that the timestamp is not in the lead position of the key, and the design assumption is that there are dozens or hundreds (or more) of different metric types. Thus, even with a continual stream of input data with a mix of metric types, the Puts are distributed across various points of regions in the table.
In HBase, values are always freighted with their coordinates; as a cell value passes through the system, it'll be accompanied by its row, column name, and timestamp - always. If your rows and column names are large, especially compared to the size of the cell value, then you may run up against some interesting scenarios. One such is the case described by Marc Limotte at the tail of HBASE-3551 (recommended!). Therein, the indices that are kept on HBase storefiles (the section called “StoreFile (HFile)”) to facilitate random access may end up occupyng large chunks of the HBase allotted RAM because the cell value coordinates are large. Mark in the above cited comment suggests upping the block size so entries in the store file index happen at a larger interval or modify the table schema so it makes for smaller rows and column names. Compression will also make for larger indices. See the thread a question storefileIndexSize up on the user mailing list. `
In summary, although verbose attribute names (e.g., "myImportantAttribute") are easier to read, you pay for the clarity in storage and increased I/O - use shorter attribute names and constants. Also, try to keep the row-keys as small as possible too.
The number of row versions to store is configured per column family via HColumnDescriptor. The default is 3. This is an important parameter because as described in Chapter 5, Data Model section HBase does not overwrite row values, but rather stores different values per row by time (and qualifier). Excess versions are removed during major compactions. The number of versions may need to be increased or decreased depending on application needs.
Rowkeys cannot be changed. The only way they can be "changed" in a table is if the row is deleted and then re-inserted. This is a fairly common question on the HBase dist-list so it pays to get the rowkeys right the first time (and/or before you've inserted a lot of data).
HBase supports a "bytes-in/bytes-out" interface via Put and Result, so anything that can be converted to an array of bytes can be stored as a value. Input could be strings, numbers, complex objects, or even images as long as they can rendered as bytes.
There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask); search the mailling list for conversations on this topic. All rows in HBase conform to the Chapter 5, Data Model, and that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily.
One supported datatype that deserves special mention are "counters" (i.e., the ability to do atomic increments of numbers). See Increment in HTable.
Synchronization on counters are done on the RegionServer, not in the client.
ColumnFamilies can optionally be defined as in-memory. Data is still persisted to disk, just like any other ColumnFamily. In-memory blocks have the highest priority in the the section called “Block Cache”, but it is not a guarantee that the entire table will be in memory.
See HColumnDescriptor for more information.
This section could also be titled "what if my table rowkey looks like this but I also want to query my table like that." A common example on the dist-list is where a row-key is of the format "user-timestamp" but there are are reporting requirements on activity across users for certain time ranges. Thus, selecting by user is easy because it is in the lead position of the key, but time is not.
There is no single answer on the best way to handle this because it depends on...
... and solutions are also influenced by the size of the cluster and how much processing power you have to throw at the solution. Common techniques are in sub-sections below. This is a comprehensive, but not exhaustive, list of approaches.
It should not be a surprise that secondary indexes require additional cluster space and processing. This is precisely what happens in an RDBMS because the act of creating an alternate index requires both space and processing cycles to update. RBDMS products are more advanced in this regard to handle alternative index management out of the box. However, HBase scales better at larger data volumes, so this is a feature trade-off.
Pay attention to ??? when implementing any of these approaches.
Additionally, see the David Butler response in this dist-list thread HBase, mail # user - Stargate+hbase
Depending on the case, it may be appropriate to use the section called “Filters”. In this case, no secondary index is created. However, don't try a full-scan on a large table like this from an application (i.e., single-threaded client).
A secondary index could be created in an other table which is periodically updated via a MapReduce job. The job could be executed intra-day, but depending on load-strategy it could still potentially be out of sync with the main data table.
See Chapter 1, HBase and MapReduce for more information.
Another strategy is to build the secondary index while publishing data to the cluster (e.g., write to data table, write to index table). If this is approach is taken after a data table already exists, then bootstrapping will be needed for the secondary index with a MapReduce job (see the section called “ Periodic-Update Secondary Index ”).
Where time-ranges are very wide (e.g., year-long report) and where the data is voluminous, summary tables are a common approach. These would be generated with MapReduce jobs into another table.
See Chapter 1, HBase and MapReduce for more information.
Table of Contents
hbase.regionserver.blockCacheCounthbase.regionserver.blockCacheFreehbase.regionserver.blockCacheHitRatiohbase.regionserver.blockCacheSizehbase.regionserver.compactionQueueSizehbase.regionserver.fsReadLatency_avg_timehbase.regionserver.fsReadLatency_num_opshbase.regionserver.fsSyncLatency_avg_timehbase.regionserver.fsSyncLatency_num_opshbase.regionserver.fsWriteLatency_avg_timehbase.regionserver.fsWriteLatency_num_opshbase.regionserver.memstoreSizeMBhbase.regionserver.regionshbase.regionserver.requestshbase.regionserver.storeFileIndexSizeMBhbase.regionserver.storeshbase.regionserver.storeFilesSee Metrics for an introduction and how to enable Metrics emission.
Block cache item count in memory. This is the number of blocks of storefiles (HFiles) in the cache.
Block cache hit ratio (0 to 100). TODO: describe impact to ratio where read requests that have cacheBlocks=false
Block cache size in memory (bytes). i.e., memory in use by the BlockCache
Size of the compaction queue. This is the number of stores in the region that have been targeted for compaction.
Filesystem read latency (ms). This is the average time to read from HDFS.
Total number of read and write requests. Requests correspond to RegionServer RPC calls, thus a single Get will result in 1 request, but a Scan with caching set to 1000 will result in 1 request for each 'next' call (i.e., not each row). A bulk-load request will constitute 1 request per HFile.
Sum of all the storefile index sizes in this RegionServer (MB)
Number of stores open on the RegionServer. A store corresponds to a column family. For example, if a table (which contains the column family) has 3 regions on a RegionServer, there will be 3 stores open for that column family.
See Cluster Replication.
Table of Contents
In short, applications store data into an HBase table. Tables are made of rows and columns. All columns in HBase belong to a particular column family. Table cells -- the intersection of row and column coordinates -- are versioned. A cell’s content is an uninterpreted array of bytes.
Table row keys are also byte arrays so almost anything can serve as a row key from strings to binary representations of longs or even serialized data structures. Rows in HBase tables are sorted by row key. The sort is byte-ordered. All table accesses are via the table row key -- its primary key.
The following example is a slightly modified form of the one on page
2 of the BigTable paper.
There is a table called webtable that contains two column families named
contents and anchor.
In this example, anchor contains two
columns (anchor:cssnsi.com, anchor:my.look.ca)
and contents contains one column (contents:html).
By convention, a column name is made of its column family prefix and a
qualifier. For example, the
column
contents:html is of the column family contents
The colon character (:) delimits the column family from the
column family qualifier.
Table 5.1. Table webtable
| Row Key | Time Stamp | ColumnFamily contents | ColumnFamily anchor |
|---|---|---|---|
| "com.cnn.www" | t9 | anchor:cnnsi.com = "CNN" | |
| "com.cnn.www" | t8 | anchor:my.look.ca = "CNN.com" | |
| "com.cnn.www" | t6 | contents:html = "<html>..." | |
| "com.cnn.www" | t5 | contents:html = "<html>..." | |
| "com.cnn.www" | t3 | contents:html = "<html>..." |
Although at a conceptual level tables may be viewed as a sparse set of rows.
Physically they are stored on a per-column family basis. New columns
(i.e., columnfamily:column) can be added to any
column family without pre-announcing them.
Table 5.2. ColumnFamily anchor
| Row Key | Time Stamp | Column Family anchor |
|---|---|---|
| "com.cnn.www" | t9 | anchor:cnnsi.com = "CNN" |
| "com.cnn.www" | t8 | anchor:my.look.ca = "CNN.com" |
Table 5.3. ColumnFamily contents
| Row Key | Time Stamp | ColumnFamily "contents:" |
|---|---|---|
| "com.cnn.www" | t6 | contents:html = "<html>..." |
| "com.cnn.www" | t5 | contents:html = "<html>..." |
| "com.cnn.www" | t3 | contents:html = "<html>..." |
It is important to note in the diagram above that the empty cells shown in the
conceptual view are not stored since they need not be in a column-oriented
storage format. Thus a request for the value of the contents:html
column at time stamp t8 would return no value. Similarly, a
request for an anchor:my.look.ca value at time stamp
t9 would return no value. However, if no timestamp is
supplied, the most recent value for a particular column would be returned
and would also be the first one found since timestamps are stored in
descending order. Thus a request for the values of all columns in the row
com.cnn.www if no timestamp is specified would be:
the value of contents:html from time stamp
t6, the value of anchor:cnnsi.com
from time stamp t9, the value of
anchor:my.look.ca from time stamp t8.
Row keys are uninterrpreted bytes. Rows are lexicographically sorted with the lowest order appearing first in a table. The empty byte array is used to denote both the start and end of a tables' namespace.
Columns in HBase are grouped into column families.
All column members of a column family have the same prefix. For example, the
columns courses:history and
courses:math are both members of the
courses column family.
The colon character (:) delimits the column family from the
.
The column family prefix must be composed of
printable characters. The qualifying tail, the
column family qualifier, can be made of any
arbitrary bytes. Column families must be declared up front
at schema definition time whereas columns do not need to be
defined at schema time but can be conjured on the fly while
the table is up an running.
Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics.
A {row, column, version} tuple exactly
specifies a cell in HBase.
Cell content is uninterrpreted bytes
A {row, column, version} tuple exactly
specifies a cell in HBase. Its possible to have an
unbounded number of cells where the row and column are the same but the
cell address differs only in its version dimension.
While rows and column keys are expressed as bytes, the version is
specified using a long integer. Typically this long contains time
instances such as those returned by
java.util.Date.getTime() or
System.currentTimeMillis(), that is: “the difference,
measured in milliseconds, between the current time and midnight, January
1, 1970 UTC”.
The HBase version dimension is stored in decreasing order, so that when reading from a store file, the most recent values are found first.
There is a lot of confusion over the semantics of
cell versions, in HBase. In particular, a couple
questions that often come up are:
Below we describe how the version dimension in HBase currently works[3].
In this section we look at the behavior of the version dimension for each of the core HBase operations.
Gets are implemented on top of Scans. The below discussion of Get applies equally to Scans.
By default, i.e. if you specify no explicit version, when
doing a get, the cell whose version has the
largest value is returned (which may or may not be the latest one
written, see later). The default behavior can be modified in the
following ways:
to return more than one version, see Get.setMaxVersions()
to return versions other than the latest, see Get.setTimeRange()
To retrieve the latest version that is less than or equal to a given value, thus giving the 'latest' state of the record at a certain point in time, just use a range from 0 to the desired version and set the max versions to 1.
The following Get will only retrieve the current version of the row
Get get = new Get(Bytes.toBytes("row1"));
Result r = htable.get(get);
byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns current version of value
The following Get will return the last 3 versions of the row.
Get get = new Get(Bytes.toBytes("row1"));
get.setMaxVersions(3); // will return last 3 versions of row
Result r = htable.get(get);
byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns current version of value
List<KeyValue> kv = r.getColumn(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns all versions of this column
Doing a put always creates a new version of a
cell, at a certain timestamp. By default the
system uses the server's currentTimeMillis, but
you can specify the version (= the long integer) yourself, on a
per-column level. This means you could assign a time in the past or
the future, or use the long value for non-time purposes.
To overwrite an existing value, do a put at exactly the same row, column, and version as that of the cell you would overshadow.
The following Put will be implicitly versioned by HBase with the current time.
Put put = new Put(Bytes.toBytes(row));
put.add(Bytes.toBytes("cf"), Bytes.toBytes("attr1"), Bytes.toBytes( data));
htable.put(put);
When performing a delete operation in HBase, there are two ways to specify the versions to be deleted
Delete all versions older than a certain timestamp
Delete the version at a specific timestamp
A delete can apply to a complete row, a complete column family, or to just one column. It is only in the last case that you can delete explicit versions. For the deletion of a row or all the columns within a family, it always works by deleting all cells older than a certain version.
Deletes work by creating tombstone
markers. For example, let's suppose we want to delete a row. For
this you can specify a version, or else by default the
currentTimeMillis is used. What this means is
“delete all cells where the version is less than or equal to
this version”. HBase never modifies data in place, so for
example a delete will not immediately delete (or mark as deleted)
the entries in the storage file that correspond to the delete
condition. Rather, a so-called tombstone is
written, which will mask the deleted values[4]. If the version you specified when deleting a row is
larger than the version of any value in the row, then you can
consider the complete row to be deleted.
There are still some bugs (or at least 'undecided behavior') with the version dimension that will be addressed by later HBase releases.
Deletes mask puts, even puts that happened after the delete was entered[5]. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything <= T. After this you do a new put with a timestamp <= T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond.
“...create three cell versions at t1, t2 and t3, with a maximum-versions setting of 2. So when getting all versions, only the values at t2 and t3 will be returned. But if you delete the version at t2 or t3, the one at t1 will appear again. Obviously, once a major compaction has run, such behavior will not be the case anymore...[6]”
[1] Currently, only the last written is fetchable.
[2] Yes
[3] See HBASE-2406 for discussion of HBase versions. Bending time in HBase makes for a good read on the version, or time, dimension in HBase. It has more detail on versioning than is provided here. As of this writing, the limiitation Overwriting values at existing timestamps mentioned in the article no longer holds in HBase. This section is basically a synopsis of this article by Bruno Dumon.
[4] When HBase does a major compaction, the tombstones are processed to actually remove the dead values, together with the tombstones themselves.
[5] HBASE-2256
[6] See Garbage Collection in Bending time in HBase
Table of Contents
The HBase client
HTable
is responsible for finding RegionServers that are serving the
particular row range of interest. It does this by querying
the .META. and -ROOT- catalog tables
(TODO: Explain). After locating the required
region(s), the client directly contacts
the RegionServer serving that region (i.e., it does not go
through the master) and issues the read or write request.
This information is cached in the client so that subsequent requests
need not go through the lookup process. Should a region be reassigned
either by the master load balancer or because a RegionServer has died,
the client will requery the catalog tables to determine the new
location of the user region.
Administrative functions are handled through HBaseAdmin
For connection configuration information, see ???.
HTable instances are not thread-safe. When creating HTable instances, it is advisable to use the same HBaseConfiguration instance. This will ensure sharing of ZooKeeper and socket instances to the RegionServers which is usually what you want. For example, this is preferred:
HBaseConfiguration conf = HBaseConfiguration.create(); HTable table1 = new HTable(conf, "myTable"); HTable table2 = new HTable(conf, "myTable");
as opposed to this:
HBaseConfiguration conf1 = HBaseConfiguration.create(); HTable table1 = new HTable(conf1, "myTable"); HBaseConfiguration conf2 = HBaseConfiguration.create(); HTable table2 = new HTable(conf2, "myTable");
For more information about how connections are handled in the HBase client, see HConnectionManager.
For applications which require high-end multithreaded access (e.g., web-servers or application servers that may serve many application threads in a single JVM), see HTablePool.
If ??? is turned off on
HTable,
Puts are sent to RegionServers when the writebuffer
is filled. The writebuffer is 2MB by default. Before an HTable instance is
discarded, either close() or
flushCommits() should be invoked so Puts
will not be lost.
Note: htable.delete(Delete); does not go in the writebuffer! This only applies to Puts.
For additional information on write durability, review the ACID semantics page.
For fine-grained control of batching of
Puts or Deletes,
see the batch methods on HTable.
Get and Scan instances can be optionally configured with filters which are applied on the RegionServer.
This allows the user to perform server-side filtering when accessing HBase over Thrift. The user specifies a filter via a string. The string is parsed on the server to construct the filter
A simple filter expression is expressed as: “FilterName (argument, argument, ... , argument)”
You must specify the name of the filter followed by the argument list in parenthesis. Commas separate the individual arguments
If the argument represents a string, it should be enclosed in single quotes.
If it represents a boolean, an integer or a comparison operator like <, >, != etc. it should not be enclosed in quotes
The filter name must be one word. All ASCII characters are allowed except for whitespace, single quotes and parenthesis.
The filter’s arguments can contain any ASCII character. If single quotes are present in the argument, they must be escaped by a
preceding single quote
Currently, two binary operators – AND/OR and two unary operators – WHILE/SKIP are supported.
Note: the operators are all in uppercase
AND – as the name suggests, if this operator is used, the key-value must pass both the filters
OR – as the name suggests, if this operator is used, the key-value must pass at least one of the filters
SKIP – For a particular row, if any of the key-values don’t pass the filter condition, the entire row is skipped
WHILE - For a particular row, it continues to emit key-values until a key-value is reached that fails the filter condition
Compound Filters: Using these operators, a
hierarchy of filters can be created. For example: “(Filter1 AND Filter2) OR (Filter3 AND Filter4)”
Parenthesis have the highest precedence. The SKIP and WHILE operators are next and have the same precedence.The AND operator has the next highest precedence followed by the OR operator.
For example:
A filter string of the form:“Filter1 AND Filter2 OR Filter3”
will be evaluated as:“(Filter1 AND Filter2) OR Filter3”
A filter string of the form:“Filter1 AND SKIP Filter2 OR Filter3”
will be evaluated as:“(Filter1 AND (SKIP Filter2)) OR Filter3”
A compare operator can be any of the following:
LESS (<)
LESS_OR_EQUAL (<=)
EQUAL (=)
NOT_EQUAL (!=)
GREATER_OR_EQUAL (>=)
GREATER (>)
NO_OP (no operation)
The client should use the symbols (<, <=, =, !=, >, >=) to express compare operators.
A comparator can be any of the following:
BinaryComparator - This lexicographically compares against the specified byte array using Bytes.compareTo(byte[], byte[])
BinaryPrefixComparator - This lexicographically compares against a specified byte array. It only compares up to the length of this byte array.
RegexStringComparator - This compares against the specified byte array using the given regular expression. Only EQUAL and NOT_EQUAL comparisons are valid with this comparator
SubStringComparator - This tests if the given substring appears in a specified byte array. The comparison is case insensitive. Only EQUAL and NOT_EQUAL comparisons are valid with this comparator
The general syntax of a comparator is: ComparatorType:ComparatorValue
The ComparatorType for the various comparators is as follows:
BinaryComparator - binary
BinaryPrefixComparator - binaryprefix
RegexStringComparator - regexstring
SubStringComparator - substring
The ComparatorValue can be any value.
Example1: >, 'binary:abc' will match everything that is lexicographically greater than "abc"
Example2: =, 'binaryprefix:abc' will match everything whose first 3 characters are lexicographically equal to "abc"
Example3: !=, 'regexstring:ab*yz' will match everything that doesn't begin with "ab" and ends with "yz"
Example4: =, 'substring:abc123' will match everything that begins with the substring "abc123"
<? $_SERVER['PHP_ROOT'] = realpath(dirname(__FILE__).'/..');
require_once $_SERVER['PHP_ROOT'].'/flib/__flib.php';
flib_init(FLIB_CONTEXT_SCRIPT);
require_module('storage/hbase');
$hbase = new HBase('<server_name_running_thrift_server>', <port on which thrift server is running>);
$hbase->open();
$client = $hbase->getClient();
$result = $client->scannerOpenWithFilterString('table_name', "(PrefixFilter ('row2') AND (QualifierFilter (>=, 'binary:xyz'))) AND (TimestampsFilter ( 123, 456))");
$to_print = $client->scannerGetList($result,1);
while ($to_print) {
print_r($to_print);
$to_print = $client->scannerGetList($result,1);
}
$client->scannerClose($result);
?>
“PrefixFilter (‘Row’) AND PageFilter (1) AND FirstKeyOnlyFilter ()” will return all key-value pairs that match the following conditions:
1) The row containing the key-value should have prefix “Row”
2) The key-value must be located in the first row of the table
3) The key-value pair must be the first key-value in the row
“(RowFilter (=, ‘binary:Row 1’) AND TimeStampsFilter (74689, 89734)) OR
ColumnRangeFilter (‘abc’, true, ‘xyz’, false))” will return all key-value pairs that match both the following conditions:
1) The key-value is in a row having row key “Row 1”
2) The key-value must have a timestamp of either 74689 or 89734.
Or it must match the following condition:
1) The key-value pair must be in a column that is lexicographically >= abc and < xyz
“SKIP ValueFilter (0)” will skip the entire row if any of the values in the row is not 0
KeyOnlyFilter
Description: This filter doesn’t take any arguments. It returns only the key component of each key-value.
Syntax: KeyOnlyFilter ()
Example: "KeyOnlyFilter ()"
FirstKeyOnlyFilter
Description: This filter doesn’t take any arguments. It returns only the first key-value from each row.
Syntax: FirstKeyOnlyFilter ()
Example: "FirstKeyOnlyFilter ()"
PrefixFilter
Description: This filter takes one argument – a prefix of a row key. It returns only those key-values present in a row that starts with the specified row prefix
Syntax: PrefixFilter (‘<row_prefix>’)
Example: "PrefixFilter (‘Row’)"
ColumnPrefixFilter
Description: This filter takes one argument
– a column prefix. It returns only those key-values present in a column that starts
with the specified column prefix. The column prefix must be of the form: “qualifier”
Syntax:ColumnPrefixFilter(‘<column_prefix>’)
Example: "ColumnPrefixFilter(‘Col’)"
MultipleColumnPrefixFilter
Description: This filter takes a list of
column prefixes. It returns key-values that are present in a column that starts with
any of the specified column prefixes. Each of the column prefixes must be of the form: “qualifier”
Syntax:MultipleColumnPrefixFilter(‘<column_prefix>’, ‘<column_prefix>’, …, ‘<column_prefix>’)
Example: "MultipleColumnPrefixFilter(‘Col1’, ‘Col2’)"
ColumnCountGetFilter
Description: This filter takes one argument – a limit. It returns the first limit number of columns in the table
Syntax: ColumnCountGetFilter (‘<limit>’)
Example: "ColumnCountGetFilter (4)"
PageFilter
Description: This filter takes one argument – a page size. It returns page size number of rows from the table.
Syntax: PageFilter (‘<page_size>’)
Example: "PageFilter (2)"
ColumnPaginationFilter
Description: This filter takes two arguments – a limit and offset. It returns limit number of columns after offset number of columns. It does this for all the rows
Syntax: ColumnPaginationFilter(‘<limit>’, ‘<offest>’)
Example: "ColumnPaginationFilter (3, 5)"
InclusiveStopFilter
Description: This filter takes one argument – a row key on which to stop scanning. It returns all key-values present in rows up to and including the specified row
Syntax: InclusiveStopFilter(‘<stop_row_key>’)
Example: "InclusiveStopFilter (‘Row2’)”
TimeStampsFilter
Description: This filter takes a list of timestamps. It returns those key-values whose timestamps matches any of the specified timestamps
Syntax: TimeStampsFilter (<timestamp>, <timestamp>, ... ,<timestamp>)
Example: "TimeStampsFilter (5985489, 48895495, 58489845945)"
RowFilter
Description: This filter takes a compare operator and a comparator. It compares each row key with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that row
Syntax: RowFilter (<compareOp>, ‘<row_comparator>’)
Example: "RowFilter (<=, ‘xyz)"
Family Filter
Description: This filter takes a compare operator and a comparator. It compares each qualifier name with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that column
Syntax: QualifierFilter (<compareOp>, ‘<qualifier_comparator>’)
Example: "QualifierFilter (=, ‘Column1’)"
QualifierFilter
Description: This filter takes a compare operator and a comparator. It compares each qualifier name with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that column
Syntax: QualifierFilter (<compareOp>,‘<qualifier_comparator>’)
Example: "QualifierFilter (=,‘Column1’)"
ValueFilter
Description: This filter takes a compare operator and a comparator. It compares each value with the comparator using the compare operator and if the comparison returns true, it returns that key-value
Syntax: ValueFilter (<compareOp>,‘<value_comparator>’)
Example: "ValueFilter (!=, ‘Value’)"
DependentColumnFilter
Description: This filter takes two arguments – a family and a qualifier. It tries to locate this column in each row and returns all key-values in that row that have the same timestamp. If the row doesn’t contain the specified column – none of the key-values in that row will be returned.
The filter can also take an optional boolean argument – dropDependentColumn. If set to true, the column we were depending on doesn’t get returned.
The filter can also take two more additional optional arguments – a compare operator and a value comparator, which are further checks in addition to the family and qualifier. If the dependent column is found, its value should also pass the value check and then only is its timestamp taken into consideration
Syntax: DependentColumnFilter (‘<family>’, ‘<qualifier>’, <boolean>, <compare operator>, ‘<value comparator’)
Syntax: DependentColumnFilter (‘<family>’, ‘<qualifier>’, <boolean>)
Syntax: DependentColumnFilter (‘<family>’, ‘<qualifier>’)
Example: "DependentColumnFilter (‘conf’, ‘blacklist’, false, >=, ‘zebra’)"
Example: "DependentColumnFilter (‘conf’, 'blacklist', true)"
Example: "DependentColumnFilter (‘conf’, 'blacklist')"
SingleColumnValueFilter
Description: This filter takes a column family, a qualifier, a compare operator and a comparator. If the specified column is not found – all the columns of that row will be emitted. If the column is found and the comparison with the comparator returns true, all the columns of the row will be emitted. If the condition fails, the row will not be emitted.
This filter also takes two additional optional boolean arguments – filterIfColumnMissing and setLatestVersionOnly
If the filterIfColumnMissing flag is set to true the columns of the row will not be emitted if the specified column to check is not found in the row. The default value is false.
If the setLatestVersionOnly flag is set to false, it will test previous versions (timestamps) too. The default value is true.
These flags are optional and if you must set neither or both
Syntax: SingleColumnValueFilter(<compare operator>, ‘<comparator>’, ‘<family>’, ‘<qualifier>’,<filterIfColumnMissing_boolean>, <latest_version_boolean>)
Syntax: SingleColumnValueFilter(<compare operator>, ‘<comparator>’, ‘<family>’, ‘<qualifier>)
Example: "SingleColumnValueFilter (<=, ‘abc’,‘FamilyA’, ‘Column1’, true, false)"
Example: "SingleColumnValueFilter (<=, ‘abc’,‘FamilyA’, ‘Column1’)"
SingleColumnValueExcludeFilter
Description: This filter takes the same arguments and behaves same as SingleColumnValueFilter – however, if the column is found and the condition passes, all the columns of the row will be emitted except for the tested column value.
Syntax: SingleColumnValueExcludeFilter(<compare operator>, '<comparator>', '<family>', '<qualifier>',<latest_version_boolean>, <filterIfColumnMissing_boolean>)
Syntax: SingleColumnValueExcludeFilter(<compare operator>, '<comparator>', '<family>', '<qualifier>')
Example: "SingleColumnValueExcludeFilter (‘<=’, ‘abc’,‘FamilyA’, ‘Column1’, ‘false’, ‘true’)"
Example: "SingleColumnValueExcludeFilter (‘<=’, ‘abc’, ‘FamilyA’, ‘Column1’)"
ColumnRangeFilter
Description: This filter is used for selecting only those keys with columns that are between minColumn and maxColumn. It also takes two boolean variables to indicate whether to include the minColumn and maxColumn or not.
If you don’t want to set the minColumn or the maxColumn – you can pass in an empty argument.
Syntax: ColumnRangeFilter (‘<minColumn>’, <minColumnInclusive_bool>, ‘<maxColumn>’, <maxColumnInclusive_bool>)
Example: "ColumnRangeFilter (‘abc’, true, ‘xyz’, false)"
HMaster is the implementation of the Master Server. The Master server
is responsible for monitoring all RegionServer instances in the cluster, and is
the interface for all metadata changes.
If run in a multi-Master environment, all Masters compete to run the cluster. If the active Master loses it's lease in ZooKeeper (or the Master shuts down), then then the remaining Masters jostle to take over the Master role.
The methods exposed by HMasterInterface are primarily metadata-oriented methods:
For example, when the HBaseAdmin method disableTable is invoked, it is serviced by the Master server.
HRegionServer is the RegionServer implementation. It is responsible for serving and managing regions.
The methods exposed by HRegionRegionInterface contain both data-oriented and region-maintenance methods:
For example, when the HBaseAdmin method majorCompact is invoked on a table, the client is actually iterating through
all regions for the specified table and requesting a major compaction directly to each region.
The RegionServer runs a variety of background threads:
CompactSplitThread checks for splits and handle minor compactions.
MajorCompactionChecker checks for major compactions.
MemStoreFlusher periodically flushes in-memory writes in the MemStore to StoreFiles.
LogRoller periodically checks the RegionServer's HLog.
This chapter is all about Regions.
Regions are comprised of a Store per Column Family.
Region size is one of those tricky things, there are a few factors to consider:
Regions are the basic element of availability and distribution.
HBase scales by having regions across many servers. Thus if you have 2 regions for 16GB data, on a 20 node machine you are a net loss there.
High region count has been known to make things slow, this is getting better, but it is probably better to have 700 regions than 3000 for the same amount of data.
Low region count prevents parallel scalability as per point #2. This really cant be stressed enough, since a common problem is loading 200MB data into HBase then wondering why your awesome 10 node cluster is mostly idle.
There is not much memory footprint difference between 1 region and 10 in terms of indexes, etc, held by the RegionServer.
Its probably best to stick to the default, perhaps going smaller for hot tables (or manually split hot regions to spread the load over the cluster), or go with a 1GB region size if your cell sizes tend to be largish (100k and up).
Splits run unaided on the RegionServer; i.e. the Master does not participate. The RegionServer splits a region, offlines the split region and then adds the daughter regions to META, opens daughters on the parent's hosting RegionServer and then reports the split to the Master. See ??? for how to manually manage splits (and for why you might do this)
Periodically, and when there are not any regions in transition, a load balancer will run and move regions around to balance cluster load. The period at which it runs can be configured.
A Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family for a table for a given region.
The MemStore holds in-memory modifications to the Store. Modifications are KeyValues. When asked to flush, current memstore is moved to snapshot and is cleared. HBase continues to serve edits out of new memstore and backing snapshot until flusher reports in that the flush succeeded. At this point the snapshot is let go.
The hfile file format is based on the SSTable file described in the BigTable [2006] paper and on Hadoop's tfile (The unit test suite and the compression harness were taken directly from tfile). Schubert Zhang's blog post on HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs makes for a thorough introduction to HBase's hfile. Matteo Bertozzi has also put up a helpful description, HBase I/O: HFile.
To view a textualized version of hfile content, you can do use
the org.apache.hadoop.hbase.io.hfile.HFile
tool. Type the following to see usage:
$ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.io.hfile.HFile For
example, to view the content of the file
hdfs://10.81.47.41:8020/hbase/TEST/1418428042/DSMP/4759508618286845475,
type the following:
$ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.io.hfile.HFile -v -f hdfs://10.81.47.41:8020/hbase/TEST/1418428042/DSMP/4759508618286845475 If
you leave off the option -v to see just a summary on the hfile. See
usage for other things to do with the HFile
tool.
There are two types of compactions: minor and major. Minor compactions will usually pick up a couple of the smaller adjacent files and rewrite them as one. Minors do not drop deletes or expired cells, only major compactions do this. Sometimes a minor compaction will pick up all the files in the store and in this case it actually promotes itself to being a major compaction. For a description of how a minor compaction picks files to compact, see the ascii diagram in the Store source code.
After a major compaction runs there will be a single storefile per store, and this will help performance usually. Caution: major compactions rewrite all of the stores data and on a loaded system, this may not be tenable; major compactions will usually have to be done manually on large systems. See ???.
The Block Cache contains three levels of block priority to allow for scan-resistance and in-memory ColumnFamilies. A block is added with an in-memory flag if the containing ColumnFamily is defined in-memory, otherwise a block becomes a single access priority. Once a block is accessed again, it changes to multiple access. This is used to prevent scans from thrashing the cache, adding a least-frequently-used element to the eviction algorithm. Blocks from in-memory ColumnFamilies are the last to be evicted.
For more information, see the LruBlockCache source
Each RegionServer adds updates (Puts, Deletes) to its write-ahead log (WAL) first, and then to the the section called “MemStore” for the affected the section called “Store”. This ensures that HBase has durable writes. Without WAL, there is the possibility of data loss in the case of a RegionServer failure before each MemStore is flushed and new StoreFiles are written. HLog is the HBase WAL implementation, and there is one HLog instance per RegionServer.
The WAL is in HDFS in/hbase/.logs/ with subdirectories per region.
For more general information about the concept of write ahead logs, see the Wikipedia Write-Ahead Log article.
When a RegionServer crashes, it will lose its ephemeral lease in ZooKeeper...TODO
When set to true, the default, any error
encountered splitting will be logged, the problematic WAL will be
moved into the .corrupt directory under the hbase
rootdir, and processing will continue. If set to
false, the exception will be propagated and the
split logged as failed.[7]
If we get an EOF while splitting logs, we proceed with the split
even when hbase.hlog.split.skip.errors ==
false. An EOF while reading the last log in the
set of files to split is near-guaranteed since the RegionServer likely
crashed mid-write of a record. But we'll continue even if we got an
EOF reading other than the last file in the set.[8]
[7] See HBASE-2958 When hbase.hlog.split.skip.errors is set to false, we fail the split but thats it. We need to do more than just fail split if this flag is set.
[8] For background, see HBASE-2643 Figure how to deal with eof splitting logs
Table of Contents
Bloom filters were developed over in HBase-1200 Add bloomfilters.[9][10]
Blooms are enabled by specifying options on a column family in the
HBase shell or in java code as specification on
org.apache.hadoop.hbase.HColumnDescriptor.
Use HColumnDescriptor.setBloomFilterType(NONE | ROW |
ROWCOL) to enable blooms per Column Family. Default =
NONE for no bloom filters. If
ROW, the hash of the row will be added to the bloom
on each insert. If ROWCOL, the hash of the row +
column family + column family qualifier will be added to the bloom on
each key insert.
io.hfile.bloom.enabled in
Configuration serves as the kill switch in case
something goes wrong. Default = true.
io.hfile.bloom.error.rate = average false
positive rate. Default = 1%. Decrease rate by ½ (e.g. to .5%) == +1
bit per bloom entry.
io.hfile.bloom.max.fold = guaranteed minimum
fold rate. Most people should leave this alone. Default = 7, or can
collapse to at least 1/128th of original size. See the
Development Process section of the document BloomFilters
in HBase for more on what this option means.
Bloom filters add an entry to the StoreFile
general FileInfo data structure and then two
extra entries to the StoreFile metadata
section.
BLOOM_FILTER_META holds Bloom Size, Hash
Function used, etc. Its small in size and is cached on
StoreFile.Reader load
[9] For description of the development process -- why static blooms rather than dynamic -- and for an overview of the unique properties that pertain to blooms in HBase, as well as possible future directions, see the Development Process section of the document BloomFilters in HBase attached to HBase-1200.
[10] The bloom filters described here are actually version two of blooms in HBase. In versions up to 0.19.x, HBase had a dynamic bloom option based on work done by the European Commission One-Lab Project 034819. The core of the HBase bloom work was later pulled up into Hadoop to implement org.apache.hadoop.io.BloomMapFile. Version 1 of HBase blooms never worked that well. Version 2 is a rewrite from scratch though again it starts with the one-lab work.
Table of Contents
Here we list HBase tools for administration, analysis, fixup, and debugging.
To run hbck against your HBase cluster run
$ ./bin/hbase hbck
At the end of the commands output it prints OK or INCONSISTENCY. If your cluster reports inconsistencies, pass -details to see more detail emitted. If inconsistencies, run hbck a few times because the inconsistency may be transient (e.g. cluster is starting up or a region is splitting). Passing -fix may correct the inconsistency (This latter is an experimental feature).
The main method on HLog offers manual
split and dump facilities. Pass it WALs or the product of a split, the
content of the recovered.edits. directory.
You can get a textual dump of a WAL file content by doing the following:
$ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLog --dump hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012 The
return code will be non-zero if issues with the file so you can test
wholesomeness of file by redirecting STDOUT to
/dev/null and testing the program return.
Similarily you can force a split of a log file directory by doing:
$ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLog --split hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/You can stop an individual RegionServer by running the following script in the HBase directory on the particular node:
$ ./bin/hbase-daemon.sh stop regionserver
The RegionServer will first close all regions and then shut itself down. On shutdown, the RegionServer's ephemeral node in ZooKeeper will expire. The master will notice the RegionServer gone and will treat it as a 'crashed' server; it will reassign the nodes the RegionServer was carrying.
If the load balancer runs while a node is shutting down, then there could be contention between the Load Balancer and the Master's recovery of the just decommissioned RegionServer. Avoid any problems by disabling the balancer first. See Load Balancer below.
A downside to the above stop of a RegionServer is that regions could be offline for
a good period of time. Regions are closed in order. If many regions on the server, the
first region to close may not be back online until all regions close and after the master
notices the RegionServer's znode gone. In HBase 0.90.2, we added facility for having
a node gradually shed its load and then shutdown itself down. HBase 0.90.2 added the
graceful_stop.sh script. Here is its usage:
$ ./bin/graceful_stop.sh Usage: graceful_stop.sh [--config &conf-dir>] [--restart] [--reload] [--thrift] [--rest] &hostname> thrift If we should stop/start thrift before/after the hbase stop/start rest If we should stop/start rest before/after the hbase stop/start restart If we should restart after graceful stop reload Move offloaded regions back on to the stopped server debug Move offloaded regions back on to the stopped server hostname Hostname of server we are to stop
To decommission a loaded RegionServer, run the following:
$ ./bin/graceful_stop.sh HOSTNAME
where HOSTNAME is the host carrying the RegionServer
you would decommission.
HOSTNAMEThe HOSTNAME passed to graceful_stop.sh
must match the hostname that hbase is using to identify RegionServers.
Check the list of RegionServers in the master UI for how HBase is
referring to servers. Its usually hostname but can also be FQDN.
Whatever HBase is using, this is what you should pass the
graceful_stop.sh decommission
script. If you pass IPs, the script is not yet smart enough to make
a hostname (or FQDN) of it and so it will fail when it checks if server is
currently running; the graceful unloading of regions will not run.
The graceful_stop.sh script will move the regions off the
decommissioned RegionServer one at a time to minimize region churn.
It will verify the region deployed in the new location before it
will moves the next region and so on until the decommissioned server
is carrying zero regions. At this point, the graceful_stop.sh
tells the RegionServer stop. The master will at this point notice the
RegionServer gone but all regions will have already been redeployed
and because the RegionServer went down cleanly, there will be no
WAL logs to split.
It is assumed that the Region Load Balancer is disabled while the graceful_stop script runs (otherwise the balancer and the decommission script will end up fighting over region deployments). Use the shell to disable the balancer:
hbase(main):001:0> balance_switch false true 0 row(s) in 0.3590 seconds
This turns the balancer OFF. To reenable, do:
hbase(main):001:0> balance_switch true false 0 row(s) in 0.3590 seconds
You can also ask this script to restart a RegionServer after the shutdown AND move its old regions back into place. The latter you might do to retain data locality. A primitive rolling restart might be effected by running something like the following:
$ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &
Tail the output of /tmp/log.txt to follow the scripts
progress. The above does RegionServers only. Be sure to disable the
load balancer before doing the above. You'd need to do the master
update separately. Do it before you run the above script.
Here is a pseudo-script for how you might craft a rolling restart script:
Untar your release, make sure of its configuration and then rsync it across the cluster. If this is 0.90.2, patch it with HBASE-3744 and HBASE-3756.
Run hbck to ensure the cluster consistent
$ ./bin/hbase hbck
Effect repairs if inconsistent.
Restart the Master:
$ ./bin/hbase-daemon.sh stop master; ./bin/hbase-daemon.sh start master
Disable the region balancer:
$ echo "balance_switch false" | ./bin/hbase
Run the graceful_stop.sh script per RegionServer. For example:
$ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &
If you are running thrift or rest servers on the RegionServer, pass --thrift or --rest options (See usage
for graceful_stop.sh script).
Restart the Master again. This will clear out dead servers list and reenable the balancer.
Run hbck to ensure the cluster is consistent.
CopyTable is a utility that can copy part or of all of a table, either to the same cluster or another cluster. The usage is as follows:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable [--rs.class=CLASS] [--rs.impl=IMPL] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] tablename
Options:
rs.class hbase.regionserver.class of the peer cluster. Specify if different from current cluster.rs.impl hbase.regionserver.impl of the peer cluster. starttime Beginning of the time range. Without endtime means starttime to forever.endtime End of the time range. Without endtime means starttime to forever.new.name New table's name.peer.adr Address of the peer cluster given in the format hbase.zookeeper.quorum:hbase.zookeeper.client.port:zookeeper.znode.parentfamilies Comma-separated list of ColumnFamilies to copy.Args:
Example of copying 'TestTable' to a cluster that uses replication for a 1 hour window:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --rs.class=org.apache.hadoop.hbase.ipc.ReplicationRegionInterface --rs.impl=org.apache.hadoop.hbase.regionserver.replication.ReplicationRegionServer --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase TestTable
Table of Contents
HBase includes a tool to test compression is set up properly.
To run it, type /bin/hbase org.apache.hadoop.hbase.util.CompressionTest.
This will emit usage on how to run the tool.
To have a RegionServer test a set of codecs and fail-to-start if any
code is missing or misinstalled, add the configuration
hbase.regionserver.codecs
to your hbase-site.xml with a value of
codecs to test on startup. For example if the
hbase.regionserver.codecs
value is lzo,gz and if lzo is not present
or improperly installed, the misconfigured RegionServer will fail
to start.
Administrators might make use of this facility to guard against the case where a new server is added to cluster but the cluster requires install of a particular coded.
Unfortunately, HBase cannot ship with LZO because of the licensing issues; HBase is Apache-licensed, LZO is GPL. Therefore LZO install is to be done post-HBase install. See the Using LZO Compression wiki page for how to make LZO work with HBase.
A common problem users run into when using LZO is that while initial setup of the cluster runs smooth, a month goes by and some sysadmin goes to add a machine to the cluster only they'll have forgotten to do the LZO fixup on the new machine. In versions since HBase 0.90.0, we should fail in a way that makes it plain what the problem is, but maybe not.
See the section called “
hbase.regionserver.codecs
”
for a feature to help protect against failed LZO install.
GZIP will generally compress better than LZO though slower. For some setups, better compression may be preferred. Java will use java's GZIP unless the native Hadoop libs are available on the CLASSPATH; in this case it will use native compressors instead (If the native libs are NOT present, you will see lots of Got brand-new compressor reports in your logs; see Q: ).
If snappy is installed, HBase can make use of it (courtesy of hadoop-snappy [11]).
Build and install snappy on all nodes of your cluster.
Use CompressionTest to verify snappy support is enabled and the libs can be loaded ON ALL NODES of your cluster:
$ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://host/path/to/hbase snappy
Create a column family with snappy compression and verify it in the hbase shell:
$ hbase> create 't1', { NAME => 'cf1', COMPRESSION => 'SNAPPY' }
hbase> describe 't1'In the output of the "describe" command, you need to ensure it lists "COMPRESSION => 'SNAPPY'"
C.1. General | |
Are there other HBase FAQs? | |
See the FAQ that is up on the wiki, HBase Wiki FAQ. | |
Does HBase support SQL? | |
Not really. SQL-ish support for HBase via Hive is in development, however Hive is based on MapReduce which is not generally suitable for low-latency requests. See the Chapter 5, Data Model section for examples on the HBase client. | |
How does HBase work on top of HDFS? | |
HDFS is a distributed file system that is well suited for the storage of large files. It's documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. See the Chapter 5, Data Model and Chapter 6, Architecture sections for more information on how HBase achieves its goals. | |
Can I change a table's rowkeys? | |
Why are logs flooded with '2011-01-10 12:40:48,407 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor' messages? | |
Because we are not using the native versions of compression libraries. See HBASE-1900 Put back native support when hadoop 0.21 is released. Copy the native libs from hadoop under hbase lib dir or symlink them into place and the message should go away. | |
C.2. EC2 | |
Why doesn't my remote java connection into my ec2 cluster work? | |
See Andrew's answer here, up on the user list: Remote Java client connection into EC2 instance. | |
C.3. Building HBase | |
When I build, why do I always get | |
Ignore it. Its not an error. It is officially ugly though. | |
C.4. Runtime | |
I'm having problems with my HBase cluster, how can I troubleshoot it? | |
See ???. | |
How can I improve HBase cluster performance? | |
See ???. | |
C.5. How do I...? | |
Secondary Indexes in HBase? | |
See the section called “ Secondary Indexes and Alternate Query Paths ” | |
Store (fill in the blank) in HBase? | |
Back up my HBase Cluster? | |
See HBase Backup Options over on the Sematext Blog. | |
TODO: Describe how YCSB is poor for putting up a decent cluster load.
TODO: Describe setup of YCSB for HBase
Ted Dunning redid YCSB so its mavenized and added facility for verifying workloads. See Ted Dunning's YCSB.
Table of Contents
We found it necessary to revise the HFile format after encountering high memory usage and slow startup times caused by large Bloom filters and block indexes in the region server. Bloom filters can get as large as 100 MB per HFile, which adds up to 2 GB when aggregated over 20 regions. Block indexes can grow as large as 6 GB in aggregate size over the same set of regions. A region is not considered opened until all of its block index data is loaded. Large Bloom filters produce a different performance problem: the first get request that requires a Bloom filter lookup will incur the latency of loading the entire Bloom filter bit array.
To speed up region server startup we break Bloom filters and block indexes into multiple blocks and write those blocks out as they fill up, which also reduces the HFile writer’s memory footprint. In the Bloom filter case, “filling up a block” means accumulating enough keys to efficiently utilize a fixed-size bit array, and in the block index case we accumulate an “index block” of the desired size. Bloom filter blocks and index blocks (we call these “inline blocks”) become interspersed with data blocks, and as a side effect we can no longer rely on the difference between block offsets to determine data block length, as it was done in version 1.
HFile is a low-level file format by design, and it should not deal with application-specific details such as Bloom filters, which are handled at StoreFile level. Therefore, we call Bloom filter blocks in an HFile "inline" blocks. We also supply HFile with an interface to write those inline blocks.
Another format modification aimed at reducing the region server startup time is to use a contiguous “load-on-open” section that has to be loaded in memory at the time an HFile is being opened. Currently, as an HFile opens, there are separate seek operations to read the trailer, data/meta indexes, and file info. To read the Bloom filter, there are two more seek operations for its “data” and “meta” portions. In version 2, we seek once to read the trailer and seek again to read everything else we need to open the file from a contiguous block.
As we will be discussing the changes we are making to the HFile format, it is useful to give a short overview of the previous (HFile version 1) format. An HFile in the existing format is structured as follows:
[12]
The block index in version 1 is very straightforward. For each entry, it contains:
Offset (long)
Uncompressed size (int)
Key (a serialized byte array written using Bytes.writeByteArray)
Key length as a variable-length integer (VInt)
Key bytes
The number of entries in the block index is stored in the fixed file trailer, and has to be passed in to the method that reads the block index. One of the limitations of the block index in version 1 is that it does not provide the compressed size of a block, which turns out to be necessary for decompression. Therefore, the HFile reader has to infer this compressed size from the offset difference between blocks. We fix this limitation in version 2, where we store on-disk block size instead of uncompressed size, and get uncompressed size from the block header.
The version of HBase introducing the above features reads both version 1 and 2 HFiles, but only writes version 2 HFiles. A version 2 HFile is structured as follows:
In the version 2 every block in the data section contains the following fields:
8 bytes: Block type, a sequence of bytes equivalent to version 1's "magic records". Supported block types are:
DATA – data blocks
LEAF_INDEX – leaf-level index blocks in a multi-level-block-index
BLOOM_CHUNK – Bloom filter chunks
META – meta blocks (not used for Bloom filters in version 2 anymore)
INTERMEDIATE_INDEX – intermediate-level index blocks in a multi-level blockindex
ROOT_INDEX – root>level index blocks in a multi>level block index
FILE_INFO – the “file info” block, a small key>value map of metadata
BLOOM_META – a Bloom filter metadata block in the load>on>open section
TRAILER – a fixed>size file trailer. As opposed to the above, this is not an HFile v2 block but a fixed>size (for each HFile version) data structure
INDEX_V1 – this block type is only used for legacy HFile v1 block
Compressed size of the block's data, not including the header (int).
Can be used for skipping the current data block when scanning HFile data.
Uncompressed size of the block's data, not including the header (int)
This is equal to the compressed size if the compression algorithm is NON
File offset of the previous block of the same type (long)
Can be used for seeking to the previous data/index block
Compressed data (or uncompressed data if the compression algorithm is NONE).
The above format of blocks is used in the following HFile sections:
Scanned block section. The section is named so because it contains all data blocks that need to be read when an HFile is scanned sequentially. Also contains leaf block index and Bloom chunk blocks.
Non-scanned block section. This section still contains unified-format v2 blocks but it does not have to be read when doing a sequential scan. This section contains “meta” blocks and intermediate-level index blocks.
We are supporting “meta” blocks in version 2 the same way they were supported in version 1, even though we do not store Bloom filter data in these blocks anymore.
There are three types of block indexes in HFile version 2, stored in two different formats (root and non-root):
Data index — version 2 multi-level block index, consisting of:
Version 2 root index, stored in the data block index section of the file
Optionally, version 2 intermediate levels, stored in the non%root format in the data index section of the file. Intermediate levels can only be present if leaf level blocks are present
Optionally, version 2 leaf levels, stored in the non%root format inline with data blocks
Meta index — version 2 root index format only, stored in the meta index section of the file
Bloom index — version 2 root index format only, stored in the “load-on-open” section as part of Bloom filter metadata.
This format applies to:
Root level of the version 2 data index
Entire meta and Bloom indexes in version 2, which are always single-level.
A version 2 root index block is a sequence of entries of the following format, similar to entries of a version 1 block index, but storing on-disk size instead of uncompressed size.
Offset (long)
This offset may point to a data block or to a deeper>level index block.
On-disk size (int)
Key (a serialized byte array stored using Bytes.writeByteArray)
Key (VInt)
Key bytes
A single-level version 2 block index consists of just a single root index block. To read a root index block of version 2, one needs to know the number of entries. For the data index and the meta index the number of entries is stored in the trailer, and for the Bloom index it is stored in the compound Bloom filter metadata.
For a multi-level block index we also store the following fields in the root index block in the load-on-open section of the HFile, in addition to the data structure described above:
Middle leaf index block offset
Middle leaf block on-disk size (meaning the leaf index block containing the reference to the “middle” data block of the file)
The index of the mid-key (defined below) in the middle leaf-level block.
These additional fields are used to efficiently retrieve the mid-key of the HFile used in HFile splits, which we define as the first key of the block with a zero-based index of (n – 1) / 2, if the total number of blocks in the HFile is n. This definition is consistent with how the mid-key was determined in HFile version 1, and is reasonable in general, because blocks are likely to be the same size on average, but we don’t have any estimates on individual key/value pair sizes.
When writing a version 2 HFile, the total number of data blocks pointed to by every leaf-level index block is kept track of. When we finish writing and the total number of leaf-level blocks is determined, it is clear which leaf-level block contains the mid-key, and the fields listed above are computed. When reading the HFile and the mid-key is requested, we retrieve the middle leaf index block (potentially from the block cache) and get the mid-key value from the appropriate position inside that leaf block.
This format applies to intermediate-level and leaf index blocks of a version 2 multi-level data block index. Every non-root index block is structured as follows.
numEntries: the number of entries (int).
entryOffsets: the “secondary index” of offsets of entries in the block, to facilitate a quick binary search on the key (numEntries + 1 int values). The last value is the total length of all entries in this index block. For example, in a non-root index block with entry sizes 60, 80, 50 the “secondary index” will contain the following int array: {0, 60, 140, 190}.
Entries. Each entry contains:
Offset of the block referenced by this entry in the file (long)
On>disk size of the referenced block (int)
Key. The length can be calculated from entryOffsets.
In contrast with version 1, in a version 2 HFile Bloom filter metadata is stored in the load-on-open section of the HFile for quick startup.
A compound Bloom filter.
Bloom filter version = 3 (int). There used to be a DynamicByteBloomFilter class that had the Bloom filter version number 2
The total byte size of all compound Bloom filter chunks (long)
Number of hash functions (int
Type of hash functions (int)
The total key count inserted into the Bloom filter (long)
The maximum total number of keys in the Bloom filter (long)
The number of chunks (int)
Comparator class used for Bloom filter keys, a UTF>8 encoded string stored using Bytes.writeByteArray
Bloom block index in the version 2 root block index format
The file info block is a serialized HbaseMapWritable (essentially a map from byte arrays to byte arrays) with the following keys, among others. StoreFile-level logic adds more keys to this.
|
hfile.LASTKEY |
The last key of the file (byte array) |
|
hfile.AVG_KEY_LEN |
The average key length in the file (int) |
|
hfile.AVG_VALUE_LEN |
The average value length in the file (int) |
File info format did not change in version 2. However, we moved the file info to the final section of the file, which can be loaded as one block at the time the HFile is being opened. Also, we do not store comparator in the version 2 file info anymore. Instead, we store it in the fixed file trailer. This is because we need to know the comparator at the time of parsing the load-on-open section of the HFile.
The following table shows common and different fields between fixed file trailers in versions 1 and 2. Note that the size of the trailer is different depending on the version, so it is “fixed” only within one version. However, the version is always stored as the last four-byte integer in the file.
|
Version 1 |
Version 2 |
|
File info offset (long) | |
|
Data index offset (long) |
loadOnOpenOffset (long) The offset of the section that we need toload when opening the file. |
|
Number of data index entries (int) | |
|
metaIndexOffset (long) This field is not being used by the version 1 reader, so we removed it from version 2. |
uncompressedDataIndexSize (long) The total uncompressed size of the whole data block index, including root-level, intermediate-level, and leaf-level blocks. |
|
Number of meta index entries (int) | |
|
Total uncompressed bytes (long) | |
|
numEntries (int) |
numEntries (long) |
|
Compression codec: 0 = LZO, 1 = GZ, 2 = NONE (int) | |
|
The number of levels in the data block index (int) | |
|
firstDataBlockOffset (long) The offset of the first first data block. Used when scanning. | |
|
lastDataBlockEnd (long) The offset of the first byte after the last key/value data block. We don't need to go beyond this offset when scanning. | |
|
Version: 1 (int) |
Version: 2 (int) |