The Apache Book

Revision History
Revision 0.91.0-SNAPSHOT  
Adding first cuts at Configuration, Getting Started, Data Model
Revision 0.89.20100924 5 October 2010 stack
Initial layout

Abstract

This is the official book of Apache HBase, a distributed, versioned, column-oriented database built on top of Apache Hadoop and Apache ZooKeeper.


Table of Contents

1. HBase and MapReduce
The default HBase MapReduce Splitter
HBase Input MapReduce Example
Accessing Other HBase Tables in a MapReduce Job
Speculative Execution
2. HBase and Schema Design
Schema Creation
On the number of column families
Monotonically Increasing Row Keys/Timeseries Data
Try to minimize row and column sizes
Number of Versions
Immutability of Rowkeys
Supported Datatypes
Counters
In-Memory ColumnFamilies
Secondary Indexes and Alternate Query Paths
Filter Query
Periodic-Update Secondary Index
Dual-Write Secondary Index
Summary Tables
Coprocessor Secondary Index
3. Metrics
Metric Setup
RegionServer Metrics
hbase.regionserver.blockCacheCount
hbase.regionserver.blockCacheFree
hbase.regionserver.blockCacheHitRatio
hbase.regionserver.blockCacheSize
hbase.regionserver.compactionQueueSize
hbase.regionserver.fsReadLatency_avg_time
hbase.regionserver.fsReadLatency_num_ops
hbase.regionserver.fsSyncLatency_avg_time
hbase.regionserver.fsSyncLatency_num_ops
hbase.regionserver.fsWriteLatency_avg_time
hbase.regionserver.fsWriteLatency_num_ops
hbase.regionserver.memstoreSizeMB
hbase.regionserver.regions
hbase.regionserver.requests
hbase.regionserver.storeFileIndexSizeMB
hbase.regionserver.stores
hbase.regionserver.storeFiles
4. Cluster Replication
5. Data Model
Conceptual View
Physical View
Table
Row
Column Family
Cells
Versions
Versions and HBase Operations
Current Limitations
6. Architecture
Client
Connections
WriteBuffer and Batch Methods
Filters
Filter Language
Daemons
Master
RegionServer
Regions
Region Size
Region Splits
Region Load Balancer
Store
Block Cache
Write Ahead Log (WAL)
Purpose
WAL Flushing
WAL Splitting
7. Bloom Filters
Configurations
HColumnDescriptor option
io.hfile.bloom.enabled global kill switch
io.hfile.bloom.error.rate
io.hfile.bloom.max.fold
Bloom StoreFile footprint
BloomFilter in the StoreFile FileInfo data structure
BloomFilter entries in StoreFile metadata
A. Tools
HBase hbck
HFile Tool
WAL Tools
HLog tool
Compression Tool
Node Decommission
Rolling Restart
CopyTable
B. Compression In HBase
CompressionTest Tool
hbase.regionserver.codecs
LZO
GZIP
SNAPPY
C. FAQ
D. YCSB: The Yahoo! Cloud Serving Benchmark and HBase
E. HFile format version 2
Motivation
HFile format version 1 overview
Block index format in version 1
HBase file format with inline blocks (version 2)
Overview
Unified version 2 block format
Block index in version 2
Root block index format in version 2
Non-root block index format in version 2
Bloom filters in version 2
File Info format in versions 1 and 2
Fixed file trailer format differences between versions 1 and 2
Index

List of Tables

5.1. Table webtable
5.2. ColumnFamily anchor
5.3. ColumnFamily contents
<xi:include></xi:include><xi:include></xi:include><xi:include></xi:include><xi:include></xi:include><xi:include></xi:include>

Chapter 1. HBase and MapReduce

See HBase and MapReduce up in javadocs. Start there. Below is some additional help.

The default HBase MapReduce Splitter

When TableInputFormat, is used to source an HBase table in a MapReduce job, its splitter will make a map task for each region of the table. Thus, if there are 100 regions in the table, there will be 100 map-tasks for the job - regardless of how many column families are selected in the Scan.

HBase Input MapReduce Example

To use HBase as a MapReduce source, the job would be configured via TableMapReduceUtil in the following manner...

Job job = ...;
Scan scan = new Scan();
scan.setCaching(500);  // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);
// Now set other scan attrs
...

TableMapReduceUtil.initTableMapperJob(
  tableName,   		// input HBase table name
  scan, 			// Scan instance to control CF and attribute selection
  MyMapper.class,		// mapper
  Text.class,		// reducer key
  LongWritable.class,	// reducer value
  job			// job instance
  );

...and the mapper instance would extend TableMapper...

public class MyMapper extends TableMapper<Text, LongWritable> {
public void map(ImmutableBytesWritable row, Result value, Context context)
throws InterruptedException, IOException {
// process data for the row from the Result instance.

Accessing Other HBase Tables in a MapReduce Job

Although the framework currently allows one HBase table as input to a MapReduce job, other HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating an HTable instance in the setup method of the Mapper.

public class MyMapper extends TableMapper<Text, LongWritable> {
  private HTable myOtherTable;

  @Override
  public void setup(Context context) {
    myOtherTable = new HTable("myOtherTable");
  }

Speculative Execution

It is generally advisable to turn off speculative execution for MapReduce jobs that use HBase as a source. This can either be done on a per-Job basis through properties, on on the entire cluster. Especially for longer running jobs, speculative execution will create duplicate map-tasks which will double-write your data to HBase; this is probably not what you want.

Chapter 2. HBase and Schema Design

A good general introduction on the strength and weaknesses modelling on the various non-rdbms datastores is Ian Varleys' Master thesis, No Relation: The Mixed Blessings of Non-Relational Databases. Recommended.

Schema Creation

HBase schemas can be created or updated with ??? or by using HBaseAdmin in the Java API.

Tables must be disabled when making ColumnFamily modifications, for example..

Configuration config = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf);
String table = "myTable";

admin.disableTable(table);

HColumnDescriptor cf1 = ...;
admin.addColumn(table, cf1  );      // adding new ColumnFamily
HColumnDescriptor cf2 = ...;
admin.modifyColumn(table, cf2 );    // modifying existing ColumnFamily

admin.enableTable(table);
      

See ??? for more information about configuring client connections.

On the number of column families

HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low. Currently, flushing and compactions are done on a per Region basis so if one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed though the amount of data they carry is small. Compaction is currently triggered by the total number of files under a column family. Its not size based. When many column families the flushing and compaction interaction can make for a bunch of needless i/o loading (To be addressed by changing flushing and compaction to work on a per column family basis).

Try to make do with one column family if you can in your schemas. Only introduce a second and third column family in the case where data access is usually column scoped; i.e. you query one column family or the other but usually not both at the one time.

Monotonically Increasing Row Keys/Timeseries Data

In the HBase chapter of Tom White's book Hadoop: The Definitive Guide (O'Reilly) there is a an optimization note on watching out for a phenomenon where an import process walks in lock-step with all clients in concert pounding one of the table's regions (and thus, a single node), then moving onto the next region, etc. With monotonically increasing row-keys (i.e., using a timestamp), this will happen. See this comic by IKai Lan on why monotically increasing row keys are problematic in BigTable-like datastores: monotonically increasing values are bad. The pile-up on a single region brought on by monoticially increasing keys can be mitigated by randomizing the input records to not be in sorted order, but in general its best to avoid using a timestamp or a sequence (e.g. 1, 2, 3) as the row-key.

If you do need to upload time series data into HBase, you should study OpenTSDB as a successful example. It has a page describing the schema it uses in HBase. The key format in OpenTSDB is effectively [metric_type][event_timestamp], which would appear at first glance to contradict the previous advice about not using a timestamp as the key. However, the difference is that the timestamp is not in the lead position of the key, and the design assumption is that there are dozens or hundreds (or more) of different metric types. Thus, even with a continual stream of input data with a mix of metric types, the Puts are distributed across various points of regions in the table.

Try to minimize row and column sizes

Or why are my storefile indices large?

In HBase, values are always freighted with their coordinates; as a cell value passes through the system, it'll be accompanied by its row, column name, and timestamp - always. If your rows and column names are large, especially compared to the size of the cell value, then you may run up against some interesting scenarios. One such is the case described by Marc Limotte at the tail of HBASE-3551 (recommended!). Therein, the indices that are kept on HBase storefiles (the section called “StoreFile (HFile)”) to facilitate random access may end up occupyng large chunks of the HBase allotted RAM because the cell value coordinates are large. Mark in the above cited comment suggests upping the block size so entries in the store file index happen at a larger interval or modify the table schema so it makes for smaller rows and column names. Compression will also make for larger indices. See the thread a question storefileIndexSize up on the user mailing list. `

In summary, although verbose attribute names (e.g., "myImportantAttribute") are easier to read, you pay for the clarity in storage and increased I/O - use shorter attribute names and constants. Also, try to keep the row-keys as small as possible too.

Number of Versions

The number of row versions to store is configured per column family via HColumnDescriptor. The default is 3. This is an important parameter because as described in Chapter 5, Data Model section HBase does not overwrite row values, but rather stores different values per row by time (and qualifier). Excess versions are removed during major compactions. The number of versions may need to be increased or decreased depending on application needs.

Immutability of Rowkeys

Rowkeys cannot be changed. The only way they can be "changed" in a table is if the row is deleted and then re-inserted. This is a fairly common question on the HBase dist-list so it pays to get the rowkeys right the first time (and/or before you've inserted a lot of data).

Supported Datatypes

HBase supports a "bytes-in/bytes-out" interface via Put and Result, so anything that can be converted to an array of bytes can be stored as a value. Input could be strings, numbers, complex objects, or even images as long as they can rendered as bytes.

There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask); search the mailling list for conversations on this topic. All rows in HBase conform to the Chapter 5, Data Model, and that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily.

Counters

One supported datatype that deserves special mention are "counters" (i.e., the ability to do atomic increments of numbers). See Increment in HTable.

Synchronization on counters are done on the RegionServer, not in the client.

In-Memory ColumnFamilies

ColumnFamilies can optionally be defined as in-memory. Data is still persisted to disk, just like any other ColumnFamily. In-memory blocks have the highest priority in the the section called “Block Cache”, but it is not a guarantee that the entire table will be in memory.

See HColumnDescriptor for more information.

Secondary Indexes and Alternate Query Paths

This section could also be titled "what if my table rowkey looks like this but I also want to query my table like that." A common example on the dist-list is where a row-key is of the format "user-timestamp" but there are are reporting requirements on activity across users for certain time ranges. Thus, selecting by user is easy because it is in the lead position of the key, but time is not.

There is no single answer on the best way to handle this because it depends on...

  • Number of users
  • Data size and data arrival rate
  • Flexibility of reporting requirements (e.g., completely ad-hoc date selection vs. pre-configured ranges)
  • Desired execution speed of query (e.g., 90 seconds may be reasonable to some for an ad-hoc report, whereas it may be too long for others)

... and solutions are also influenced by the size of the cluster and how much processing power you have to throw at the solution. Common techniques are in sub-sections below. This is a comprehensive, but not exhaustive, list of approaches.

It should not be a surprise that secondary indexes require additional cluster space and processing. This is precisely what happens in an RDBMS because the act of creating an alternate index requires both space and processing cycles to update. RBDMS products are more advanced in this regard to handle alternative index management out of the box. However, HBase scales better at larger data volumes, so this is a feature trade-off.

Pay attention to ??? when implementing any of these approaches.

Additionally, see the David Butler response in this dist-list thread HBase, mail # user - Stargate+hbase

Filter Query

Depending on the case, it may be appropriate to use the section called “Filters”. In this case, no secondary index is created. However, don't try a full-scan on a large table like this from an application (i.e., single-threaded client).

Periodic-Update Secondary Index

A secondary index could be created in an other table which is periodically updated via a MapReduce job. The job could be executed intra-day, but depending on load-strategy it could still potentially be out of sync with the main data table.

See Chapter 1, HBase and MapReduce for more information.

Dual-Write Secondary Index

Another strategy is to build the secondary index while publishing data to the cluster (e.g., write to data table, write to index table). If this is approach is taken after a data table already exists, then bootstrapping will be needed for the secondary index with a MapReduce job (see the section called “ Periodic-Update Secondary Index ”).

Summary Tables

Where time-ranges are very wide (e.g., year-long report) and where the data is voluminous, summary tables are a common approach. These would be generated with MapReduce jobs into another table.

See Chapter 1, HBase and MapReduce for more information.

Coprocessor Secondary Index

Coprocessors act like RDBMS triggers. These are currently on TRUNK.

Chapter 3. Metrics

Metric Setup

See Metrics for an introduction and how to enable Metrics emission.

RegionServer Metrics

hbase.regionserver.blockCacheCount

Block cache item count in memory. This is the number of blocks of storefiles (HFiles) in the cache.

hbase.regionserver.blockCacheFree

Block cache memory available (bytes).

hbase.regionserver.blockCacheHitRatio

Block cache hit ratio (0 to 100). TODO: describe impact to ratio where read requests that have cacheBlocks=false

hbase.regionserver.blockCacheSize

Block cache size in memory (bytes). i.e., memory in use by the BlockCache

hbase.regionserver.compactionQueueSize

Size of the compaction queue. This is the number of stores in the region that have been targeted for compaction.

hbase.regionserver.fsReadLatency_avg_time

Filesystem read latency (ms). This is the average time to read from HDFS.

hbase.regionserver.fsReadLatency_num_ops

TODO

hbase.regionserver.fsSyncLatency_avg_time

Filesystem sync latency (ms)

hbase.regionserver.fsSyncLatency_num_ops

TODO

hbase.regionserver.fsWriteLatency_avg_time

Filesystem write latency (ms)

hbase.regionserver.fsWriteLatency_num_ops

TODO

hbase.regionserver.memstoreSizeMB

Sum of all the memstore sizes in this RegionServer (MB)

hbase.regionserver.regions

Number of regions served by the RegionServer

hbase.regionserver.requests

Total number of read and write requests. Requests correspond to RegionServer RPC calls, thus a single Get will result in 1 request, but a Scan with caching set to 1000 will result in 1 request for each 'next' call (i.e., not each row). A bulk-load request will constitute 1 request per HFile.

hbase.regionserver.storeFileIndexSizeMB

Sum of all the storefile index sizes in this RegionServer (MB)

hbase.regionserver.stores

Number of stores open on the RegionServer. A store corresponds to a column family. For example, if a table (which contains the column family) has 3 regions on a RegionServer, there will be 3 stores open for that column family.

hbase.regionserver.storeFiles

Number of store filles open on the RegionServer. A store may have more than one storefile (HFile).

Chapter 4. Cluster Replication

See Cluster Replication.

Chapter 5. Data Model

In short, applications store data into an HBase table. Tables are made of rows and columns. All columns in HBase belong to a particular column family. Table cells -- the intersection of row and column coordinates -- are versioned. A cell’s content is an uninterpreted array of bytes.

Table row keys are also byte arrays so almost anything can serve as a row key from strings to binary representations of longs or even serialized data structures. Rows in HBase tables are sorted by row key. The sort is byte-ordered. All table accesses are via the table row key -- its primary key.

Conceptual View

The following example is a slightly modified form of the one on page 2 of the BigTable paper. There is a table called webtable that contains two column families named contents and anchor. In this example, anchor contains two columns (anchor:cssnsi.com, anchor:my.look.ca) and contents contains one column (contents:html).

Column Names

By convention, a column name is made of its column family prefix and a qualifier. For example, the column contents:html is of the column family contents The colon character (:) delimits the column family from the column family qualifier.

Table 5.1. Table webtable

Row KeyTime StampColumnFamily contentsColumnFamily anchor
"com.cnn.www"t9 anchor:cnnsi.com = "CNN"
"com.cnn.www"t8 anchor:my.look.ca = "CNN.com"
"com.cnn.www"t6contents:html = "<html>..." 
"com.cnn.www"t5contents:html = "<html>..." 
"com.cnn.www"t3contents:html = "<html>..." 


Physical View

Although at a conceptual level tables may be viewed as a sparse set of rows. Physically they are stored on a per-column family basis. New columns (i.e., columnfamily:column) can be added to any column family without pre-announcing them.

Table 5.2. ColumnFamily anchor

Row KeyTime StampColumn Family anchor
"com.cnn.www"t9anchor:cnnsi.com = "CNN"
"com.cnn.www"t8anchor:my.look.ca = "CNN.com"


Table 5.3. ColumnFamily contents

Row KeyTime StampColumnFamily "contents:"
"com.cnn.www"t6contents:html = "<html>..."
"com.cnn.www"t5contents:html = "<html>..."
"com.cnn.www"t3contents:html = "<html>..."


It is important to note in the diagram above that the empty cells shown in the conceptual view are not stored since they need not be in a column-oriented storage format. Thus a request for the value of the contents:html column at time stamp t8 would return no value. Similarly, a request for an anchor:my.look.ca value at time stamp t9 would return no value. However, if no timestamp is supplied, the most recent value for a particular column would be returned and would also be the first one found since timestamps are stored in descending order. Thus a request for the values of all columns in the row com.cnn.www if no timestamp is specified would be: the value of contents:html from time stamp t6, the value of anchor:cnnsi.com from time stamp t9, the value of anchor:my.look.ca from time stamp t8.

Table

Tables are declared up front at schema definition time.

Row

Row keys are uninterrpreted bytes. Rows are lexicographically sorted with the lowest order appearing first in a table. The empty byte array is used to denote both the start and end of a tables' namespace.

Column Family

Columns in HBase are grouped into column families. All column members of a column family have the same prefix. For example, the columns courses:history and courses:math are both members of the courses column family. The colon character (:) delimits the column family from the . The column family prefix must be composed of printable characters. The qualifying tail, the column family qualifier, can be made of any arbitrary bytes. Column families must be declared up front at schema definition time whereas columns do not need to be defined at schema time but can be conjured on the fly while the table is up an running.

Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics.

Cells

A {row, column, version} tuple exactly specifies a cell in HBase. Cell content is uninterrpreted bytes

Versions

A {row, column, version} tuple exactly specifies a cell in HBase. Its possible to have an unbounded number of cells where the row and column are the same but the cell address differs only in its version dimension.

While rows and column keys are expressed as bytes, the version is specified using a long integer. Typically this long contains time instances such as those returned by java.util.Date.getTime() or System.currentTimeMillis(), that is: the difference, measured in milliseconds, between the current time and midnight, January 1, 1970 UTC.

The HBase version dimension is stored in decreasing order, so that when reading from a store file, the most recent values are found first.

There is a lot of confusion over the semantics of cell versions, in HBase. In particular, a couple questions that often come up are:

  • If multiple writes to a cell have the same version, are all versions maintained or just the last?[1]

  • Is it OK to write cells in a non-increasing version order?[2]

Below we describe how the version dimension in HBase currently works[3].

Versions and HBase Operations

In this section we look at the behavior of the version dimension for each of the core HBase operations.

Get/Scan

Gets are implemented on top of Scans. The below discussion of Get applies equally to Scans.

By default, i.e. if you specify no explicit version, when doing a get, the cell whose version has the largest value is returned (which may or may not be the latest one written, see later). The default behavior can be modified in the following ways:

  • to return more than one version, see Get.setMaxVersions()

  • to return versions other than the latest, see Get.setTimeRange()

    To retrieve the latest version that is less than or equal to a given value, thus giving the 'latest' state of the record at a certain point in time, just use a range from 0 to the desired version and set the max versions to 1.

Default Get Example

The following Get will only retrieve the current version of the row

        Get get = new Get(Bytes.toBytes("row1"));
        Result r = htable.get(get);
        byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr"));  // returns current version of value          

Versioned Get Example

The following Get will return the last 3 versions of the row.

        Get get = new Get(Bytes.toBytes("row1"));
        get.setMaxVersions(3);  // will return last 3 versions of row
        Result r = htable.get(get);
        byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr"));  // returns current version of value
        List<KeyValue> kv = r.getColumn(Bytes.toBytes("cf"), Bytes.toBytes("attr"));  // returns all versions of this column
        

Put

Doing a put always creates a new version of a cell, at a certain timestamp. By default the system uses the server's currentTimeMillis, but you can specify the version (= the long integer) yourself, on a per-column level. This means you could assign a time in the past or the future, or use the long value for non-time purposes.

To overwrite an existing value, do a put at exactly the same row, column, and version as that of the cell you would overshadow.

Implicit Version Example

The following Put will be implicitly versioned by HBase with the current time.

          Put put = new Put(Bytes.toBytes(row));
          put.add(Bytes.toBytes("cf"), Bytes.toBytes("attr1"), Bytes.toBytes( data));
          htable.put(put);
          

Explicit Version Example

The following Put has the version timestamp explicitly set.

          Put put = new Put( Bytes.toBytes(row ));
          long explicitTimeInMs = 555;  // just an example
          put.add(Bytes.toBytes("cf"), Bytes.toBytes("attr1"), explicitTimeInMs, Bytes.toBytes(data));
          htable.put(put);
          

Delete

When performing a delete operation in HBase, there are two ways to specify the versions to be deleted

  • Delete all versions older than a certain timestamp

  • Delete the version at a specific timestamp

A delete can apply to a complete row, a complete column family, or to just one column. It is only in the last case that you can delete explicit versions. For the deletion of a row or all the columns within a family, it always works by deleting all cells older than a certain version.

Deletes work by creating tombstone markers. For example, let's suppose we want to delete a row. For this you can specify a version, or else by default the currentTimeMillis is used. What this means is delete all cells where the version is less than or equal to this version. HBase never modifies data in place, so for example a delete will not immediately delete (or mark as deleted) the entries in the storage file that correspond to the delete condition. Rather, a so-called tombstone is written, which will mask the deleted values[4]. If the version you specified when deleting a row is larger than the version of any value in the row, then you can consider the complete row to be deleted.

Current Limitations

There are still some bugs (or at least 'undecided behavior') with the version dimension that will be addressed by later HBase releases.

Deletes mask Puts

Deletes mask puts, even puts that happened after the delete was entered[5]. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything <= T. After this you do a new put with a timestamp <= T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond.

Major compactions change query results

...create three cell versions at t1, t2 and t3, with a maximum-versions setting of 2. So when getting all versions, only the values at t2 and t3 will be returned. But if you delete the version at t2 or t3, the one at t1 will appear again. Obviously, once a major compaction has run, such behavior will not be the case anymore...[6]



[1] Currently, only the last written is fetchable.

[2] Yes

[3] See HBASE-2406 for discussion of HBase versions. Bending time in HBase makes for a good read on the version, or time, dimension in HBase. It has more detail on versioning than is provided here. As of this writing, the limiitation Overwriting values at existing timestamps mentioned in the article no longer holds in HBase. This section is basically a synopsis of this article by Bruno Dumon.

[4] When HBase does a major compaction, the tombstones are processed to actually remove the dead values, together with the tombstones themselves.

[6] See Garbage Collection in Bending time in HBase

Chapter 6. Architecture

Client

The HBase client HTable is responsible for finding RegionServers that are serving the particular row range of interest. It does this by querying the .META. and -ROOT- catalog tables (TODO: Explain). After locating the required region(s), the client directly contacts the RegionServer serving that region (i.e., it does not go through the master) and issues the read or write request. This information is cached in the client so that subsequent requests need not go through the lookup process. Should a region be reassigned either by the master load balancer or because a RegionServer has died, the client will requery the catalog tables to determine the new location of the user region.

Administrative functions are handled through HBaseAdmin

Connections

For connection configuration information, see ???.

HTable instances are not thread-safe. When creating HTable instances, it is advisable to use the same HBaseConfiguration instance. This will ensure sharing of ZooKeeper and socket instances to the RegionServers which is usually what you want. For example, this is preferred:

HBaseConfiguration conf = HBaseConfiguration.create();
HTable table1 = new HTable(conf, "myTable");
HTable table2 = new HTable(conf, "myTable");

as opposed to this:

HBaseConfiguration conf1 = HBaseConfiguration.create();
HTable table1 = new HTable(conf1, "myTable");
HBaseConfiguration conf2 = HBaseConfiguration.create();
HTable table2 = new HTable(conf2, "myTable");

For more information about how connections are handled in the HBase client, see HConnectionManager.

Connection Pooling

For applications which require high-end multithreaded access (e.g., web-servers or application servers that may serve many application threads in a single JVM), see HTablePool.

WriteBuffer and Batch Methods

If ??? is turned off on HTable, Puts are sent to RegionServers when the writebuffer is filled. The writebuffer is 2MB by default. Before an HTable instance is discarded, either close() or flushCommits() should be invoked so Puts will not be lost.

Note: htable.delete(Delete); does not go in the writebuffer! This only applies to Puts.

For additional information on write durability, review the ACID semantics page.

For fine-grained control of batching of Puts or Deletes, see the batch methods on HTable.

Filters

Get and Scan instances can be optionally configured with filters which are applied on the RegionServer.

Filter Language

Use Case

This allows the user to perform server-side filtering when accessing HBase over Thrift. The user specifies a filter via a string. The string is parsed on the server to construct the filter

General Filter String Syntax

A simple filter expression is expressed as: “FilterName (argument, argument, ... , argument)”

You must specify the name of the filter followed by the argument list in parenthesis. Commas separate the individual arguments

If the argument represents a string, it should be enclosed in single quotes.

If it represents a boolean, an integer or a comparison operator like <, >, != etc. it should not be enclosed in quotes

The filter name must be one word. All ASCII characters are allowed except for whitespace, single quotes and parenthesis.

The filter’s arguments can contain any ASCII character. If single quotes are present in the argument, they must be escaped by a preceding single quote

Compound Filters and Operators

Currently, two binary operators – AND/OR and two unary operators – WHILE/SKIP are supported.

Note: the operators are all in uppercase

AND – as the name suggests, if this operator is used, the key-value must pass both the filters

OR – as the name suggests, if this operator is used, the key-value must pass at least one of the filters

SKIP – For a particular row, if any of the key-values don’t pass the filter condition, the entire row is skipped

WHILE - For a particular row, it continues to emit key-values until a key-value is reached that fails the filter condition

Compound Filters: Using these operators, a hierarchy of filters can be created. For example: “(Filter1 AND Filter2) OR (Filter3 AND Filter4)”

Order of Evaluation

Parenthesis have the highest precedence. The SKIP and WHILE operators are next and have the same precedence.The AND operator has the next highest precedence followed by the OR operator.

For example:

A filter string of the form:“Filter1 AND Filter2 OR Filter3” will be evaluated as:“(Filter1 AND Filter2) OR Filter3”

A filter string of the form:“Filter1 AND SKIP Filter2 OR Filter3” will be evaluated as:“(Filter1 AND (SKIP Filter2)) OR Filter3”

Compare Operator

A compare operator can be any of the following:

  1. LESS (<)

  2. LESS_OR_EQUAL (<=)

  3. EQUAL (=)

  4. NOT_EQUAL (!=)

  5. GREATER_OR_EQUAL (>=)

  6. GREATER (>)

  7. NO_OP (no operation)

The client should use the symbols (<, <=, =, !=, >, >=) to express compare operators.

Comparator

A comparator can be any of the following:

  1. BinaryComparator - This lexicographically compares against the specified byte array using Bytes.compareTo(byte[], byte[])

  2. BinaryPrefixComparator - This lexicographically compares against a specified byte array. It only compares up to the length of this byte array.

  3. RegexStringComparator - This compares against the specified byte array using the given regular expression. Only EQUAL and NOT_EQUAL comparisons are valid with this comparator

  4. SubStringComparator - This tests if the given substring appears in a specified byte array. The comparison is case insensitive. Only EQUAL and NOT_EQUAL comparisons are valid with this comparator

The general syntax of a comparator is: ComparatorType:ComparatorValue

The ComparatorType for the various comparators is as follows:

  1. BinaryComparator - binary

  2. BinaryPrefixComparator - binaryprefix

  3. RegexStringComparator - regexstring

  4. SubStringComparator - substring

The ComparatorValue can be any value.

Example1: >, 'binary:abc' will match everything that is lexicographically greater than "abc"

Example2: =, 'binaryprefix:abc' will match everything whose first 3 characters are lexicographically equal to "abc"

Example3: !=, 'regexstring:ab*yz' will match everything that doesn't begin with "ab" and ends with "yz"

Example4: =, 'substring:abc123' will match everything that begins with the substring "abc123"

Example PHP Client Program that uses the Filter Language

<? $_SERVER['PHP_ROOT'] = realpath(dirname(__FILE__).'/..');
   require_once $_SERVER['PHP_ROOT'].'/flib/__flib.php';
   flib_init(FLIB_CONTEXT_SCRIPT);
   require_module('storage/hbase');
   $hbase = new HBase('<server_name_running_thrift_server>', <port on which thrift server is running>);
   $hbase->open();
   $client = $hbase->getClient();
   $result = $client->scannerOpenWithFilterString('table_name', "(PrefixFilter ('row2') AND (QualifierFilter (>=, 'binary:xyz'))) AND (TimestampsFilter ( 123, 456))");
   $to_print = $client->scannerGetList($result,1);
   while ($to_print) {
      print_r($to_print);
      $to_print = $client->scannerGetList($result,1);
    }
   $client->scannerClose($result);
?>
        

Example Filter Strings

  • “PrefixFilter (‘Row’) AND PageFilter (1) AND FirstKeyOnlyFilter ()” will return all key-value pairs that match the following conditions:

    1) The row containing the key-value should have prefix “Row”

    2) The key-value must be located in the first row of the table

    3) The key-value pair must be the first key-value in the row

  • “(RowFilter (=, ‘binary:Row 1’) AND TimeStampsFilter (74689, 89734)) OR ColumnRangeFilter (‘abc’, true, ‘xyz’, false))” will return all key-value pairs that match both the following conditions:

    1) The key-value is in a row having row key “Row 1”

    2) The key-value must have a timestamp of either 74689 or 89734.

    Or it must match the following condition:

    1) The key-value pair must be in a column that is lexicographically >= abc and < xyz 

    • “SKIP ValueFilter (0)” will skip the entire row if any of the values in the row is not 0

    Individual Filter Syntax

    1. KeyOnlyFilter

      Description: This filter doesn’t take any arguments. It returns only the key component of each key-value.

      Syntax: KeyOnlyFilter ()

      Example: "KeyOnlyFilter ()"

    2. FirstKeyOnlyFilter

      Description: This filter doesn’t take any arguments. It returns only the first key-value from each row.

      Syntax: FirstKeyOnlyFilter ()

      Example: "FirstKeyOnlyFilter ()"

    3. PrefixFilter

      Description: This filter takes one argument – a prefix of a row key. It returns only those key-values present in a row that starts with the specified row prefix

      Syntax: PrefixFilter (‘<row_prefix>’)

      Example: "PrefixFilter (‘Row’)"

    4. ColumnPrefixFilter

      Description: This filter takes one argument – a column prefix. It returns only those key-values present in a column that starts with the specified column prefix. The column prefix must be of the form: “qualifier”

      Syntax:ColumnPrefixFilter(‘<column_prefix>’)

      Example: "ColumnPrefixFilter(‘Col’)"

    5. MultipleColumnPrefixFilter

      Description: This filter takes a list of column prefixes. It returns key-values that are present in a column that starts with any of the specified column prefixes. Each of the column prefixes must be of the form: “qualifier”

      Syntax:MultipleColumnPrefixFilter(‘<column_prefix>’, ‘<column_prefix>’, …, ‘<column_prefix>’)

      Example: "MultipleColumnPrefixFilter(‘Col1’, ‘Col2’)"

    6. ColumnCountGetFilter

      Description: This filter takes one argument – a limit. It returns the first limit number of columns in the table

      Syntax: ColumnCountGetFilter (‘<limit>’)

      Example: "ColumnCountGetFilter (4)"

    7. PageFilter

      Description: This filter takes one argument – a page size. It returns page size number of rows from the table.

      Syntax: PageFilter (‘<page_size>’)

      Example: "PageFilter (2)"

    8. ColumnPaginationFilter

      Description: This filter takes two arguments – a limit and offset. It returns limit number of columns after offset number of columns. It does this for all the rows

      Syntax: ColumnPaginationFilter(‘<limit>’, ‘<offest>’)

      Example: "ColumnPaginationFilter (3, 5)"

    9. InclusiveStopFilter

      Description: This filter takes one argument – a row key on which to stop scanning. It returns all key-values present in rows up to and including the specified row

      Syntax: InclusiveStopFilter(‘<stop_row_key>’)

      Example: "InclusiveStopFilter (‘Row2’)”

    10. TimeStampsFilter

      Description: This filter takes a list of timestamps. It returns those key-values whose timestamps matches any of the specified timestamps

      Syntax: TimeStampsFilter (<timestamp>, <timestamp>, ... ,<timestamp>)

      Example: "TimeStampsFilter (5985489, 48895495, 58489845945)"

    11. RowFilter

      Description: This filter takes a compare operator and a comparator. It compares each row key with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that row

      Syntax: RowFilter (<compareOp>, ‘<row_comparator>’)

      Example: "RowFilter (<=, ‘xyz)"

    12. Family Filter

      Description: This filter takes a compare operator and a comparator. It compares each qualifier name with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that column

      Syntax: QualifierFilter (<compareOp>, ‘<qualifier_comparator>’)

      Example: "QualifierFilter (=, ‘Column1’)"

    13. QualifierFilter

      Description: This filter takes a compare operator and a comparator. It compares each qualifier name with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that column

      Syntax: QualifierFilter (<compareOp>,‘<qualifier_comparator>’)

      Example: "QualifierFilter (=,‘Column1’)"

    14. ValueFilter

      Description: This filter takes a compare operator and a comparator. It compares each value with the comparator using the compare operator and if the comparison returns true, it returns that key-value

      Syntax: ValueFilter (<compareOp>,‘<value_comparator>’)

      Example: "ValueFilter (!=, ‘Value’)"

    15. DependentColumnFilter

      Description: This filter takes two arguments – a family and a qualifier. It tries to locate this column in each row and returns all key-values in that row that have the same timestamp. If the row doesn’t contain the specified column – none of the key-values in that row will be returned.

      The filter can also take an optional boolean argument – dropDependentColumn. If set to true, the column we were depending on doesn’t get returned.

      The filter can also take two more additional optional arguments – a compare operator and a value comparator, which are further checks in addition to the family and qualifier. If the dependent column is found, its value should also pass the value check and then only is its timestamp taken into consideration

      Syntax: DependentColumnFilter (‘<family>’, ‘<qualifier>’, <boolean>, <compare operator>, ‘<value comparator’)

      Syntax: DependentColumnFilter (‘<family>’, ‘<qualifier>’, <boolean>)

      Syntax: DependentColumnFilter (‘<family>’, ‘<qualifier>’)

      Example: "DependentColumnFilter (‘conf’, ‘blacklist’, false, >=, ‘zebra’)"

      Example: "DependentColumnFilter (‘conf’, 'blacklist', true)"

      Example: "DependentColumnFilter (‘conf’, 'blacklist')"

    16. SingleColumnValueFilter

      Description: This filter takes a column family, a qualifier, a compare operator and a comparator. If the specified column is not found – all the columns of that row will be emitted. If the column is found and the comparison with the comparator returns true, all the columns of the row will be emitted. If the condition fails, the row will not be emitted.

      This filter also takes two additional optional boolean arguments – filterIfColumnMissing and setLatestVersionOnly

      If the filterIfColumnMissing flag is set to true the columns of the row will not be emitted if the specified column to check is not found in the row. The default value is false.

      If the setLatestVersionOnly flag is set to false, it will test previous versions (timestamps) too. The default value is true.

      These flags are optional and if you must set neither or both

      Syntax: SingleColumnValueFilter(<compare operator>, ‘<comparator>’, ‘<family>’, ‘<qualifier>’,<filterIfColumnMissing_boolean>, <latest_version_boolean>)

      Syntax: SingleColumnValueFilter(<compare operator>, ‘<comparator>’, ‘<family>’, ‘<qualifier>)

      Example: "SingleColumnValueFilter (<=, ‘abc’,‘FamilyA’, ‘Column1’, true, false)"

      Example: "SingleColumnValueFilter (<=, ‘abc’,‘FamilyA’, ‘Column1’)"

    17. SingleColumnValueExcludeFilter

      Description: This filter takes the same arguments and behaves same as SingleColumnValueFilter – however, if the column is found and the condition passes, all the columns of the row will be emitted except for the tested column value.

      Syntax: SingleColumnValueExcludeFilter(<compare operator>, '<comparator>', '<family>', '<qualifier>',<latest_version_boolean>, <filterIfColumnMissing_boolean>)

      Syntax: SingleColumnValueExcludeFilter(<compare operator>, '<comparator>', '<family>', '<qualifier>')

      Example: "SingleColumnValueExcludeFilter (‘<=’, ‘abc’,‘FamilyA’, ‘Column1’, ‘false’, ‘true’)"

      Example: "SingleColumnValueExcludeFilter (‘<=’, ‘abc’, ‘FamilyA’, ‘Column1’)"

    18. ColumnRangeFilter

      Description: This filter is used for selecting only those keys with columns that are between minColumn and maxColumn. It also takes two boolean variables to indicate whether to include the minColumn and maxColumn or not.

      If you don’t want to set the minColumn or the maxColumn – you can pass in an empty argument.

      Syntax: ColumnRangeFilter (‘<minColumn>’, <minColumnInclusive_bool>, ‘<maxColumn>’, <maxColumnInclusive_bool>)

      Example: "ColumnRangeFilter (‘abc’, true, ‘xyz’, false)"

    Daemons

    Master

    HMaster is the implementation of the Master Server. The Master server is responsible for monitoring all RegionServer instances in the cluster, and is the interface for all metadata changes.

    Startup Behavior

    If run in a multi-Master environment, all Masters compete to run the cluster. If the active Master loses it's lease in ZooKeeper (or the Master shuts down), then then the remaining Masters jostle to take over the Master role.

    Interface

    The methods exposed by HMasterInterface are primarily metadata-oriented methods:

    • Table (createTable, modifyTable, removeTable, enable, disable)
    • ColumnFamily (addColumn, modifyColumn, removeColumn)
    • Region (move, assign, unassign)

    For example, when the HBaseAdmin method disableTable is invoked, it is serviced by the Master server.

    Processes

    The Master runs several background threads:

    • LoadBalancer periodically reassign regions in the cluster.
    • CatalogJanitor periodically checks and cleans up the .META. table.

    RegionServer

    HRegionServer is the RegionServer implementation. It is responsible for serving and managing regions.

    Interface

    The methods exposed by HRegionRegionInterface contain both data-oriented and region-maintenance methods:

    • Data (get, put, delete, next, etc.)
    • Region (splitRegion, compactRegion, etc.)

    For example, when the HBaseAdmin method majorCompact is invoked on a table, the client is actually iterating through all regions for the specified table and requesting a major compaction directly to each region.

    Processes

    The RegionServer runs a variety of background threads:

    • CompactSplitThread checks for splits and handle minor compactions.
    • MajorCompactionChecker checks for major compactions.
    • MemStoreFlusher periodically flushes in-memory writes in the MemStore to StoreFiles.
    • LogRoller periodically checks the RegionServer's HLog.

    Regions

    This chapter is all about Regions.

    Note

    Regions are comprised of a Store per Column Family.

    Region Size

    Region size is one of those tricky things, there are a few factors to consider:

    • Regions are the basic element of availability and distribution.

    • HBase scales by having regions across many servers. Thus if you have 2 regions for 16GB data, on a 20 node machine you are a net loss there.

    • High region count has been known to make things slow, this is getting better, but it is probably better to have 700 regions than 3000 for the same amount of data.

    • Low region count prevents parallel scalability as per point #2. This really cant be stressed enough, since a common problem is loading 200MB data into HBase then wondering why your awesome 10 node cluster is mostly idle.

    • There is not much memory footprint difference between 1 region and 10 in terms of indexes, etc, held by the RegionServer.

    Its probably best to stick to the default, perhaps going smaller for hot tables (or manually split hot regions to spread the load over the cluster), or go with a 1GB region size if your cell sizes tend to be largish (100k and up).

    Region Splits

    Splits run unaided on the RegionServer; i.e. the Master does not participate. The RegionServer splits a region, offlines the split region and then adds the daughter regions to META, opens daughters on the parent's hosting RegionServer and then reports the split to the Master. See ??? for how to manually manage splits (and for why you might do this)

    Region Load Balancer

    Periodically, and when there are not any regions in transition, a load balancer will run and move regions around to balance cluster load. The period at which it runs can be configured.

    Store

    A Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family for a table for a given region.

    MemStore

    The MemStore holds in-memory modifications to the Store. Modifications are KeyValues. When asked to flush, current memstore is moved to snapshot and is cleared. HBase continues to serve edits out of new memstore and backing snapshot until flusher reports in that the flush succeeded. At this point the snapshot is let go.

    StoreFile (HFile)

    HFile Format

    The hfile file format is based on the SSTable file described in the BigTable [2006] paper and on Hadoop's tfile (The unit test suite and the compression harness were taken directly from tfile). Schubert Zhang's blog post on HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs makes for a thorough introduction to HBase's hfile. Matteo Bertozzi has also put up a helpful description, HBase I/O: HFile.

    HFile Tool

    To view a textualized version of hfile content, you can do use the org.apache.hadoop.hbase.io.hfile.HFile tool. Type the following to see usage:

    $ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.io.hfile.HFile  

    For example, to view the content of the file hdfs://10.81.47.41:8020/hbase/TEST/1418428042/DSMP/4759508618286845475, type the following:

     $ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.io.hfile.HFile -v -f hdfs://10.81.47.41:8020/hbase/TEST/1418428042/DSMP/4759508618286845475  

    If you leave off the option -v to see just a summary on the hfile. See usage for other things to do with the HFile tool.

    Compaction

    There are two types of compactions: minor and major. Minor compactions will usually pick up a couple of the smaller adjacent files and rewrite them as one. Minors do not drop deletes or expired cells, only major compactions do this. Sometimes a minor compaction will pick up all the files in the store and in this case it actually promotes itself to being a major compaction. For a description of how a minor compaction picks files to compact, see the ascii diagram in the Store source code.

    After a major compaction runs there will be a single storefile per store, and this will help performance usually. Caution: major compactions rewrite all of the stores data and on a loaded system, this may not be tenable; major compactions will usually have to be done manually on large systems. See ???.

    Block Cache

    The Block Cache contains three levels of block priority to allow for scan-resistance and in-memory ColumnFamilies. A block is added with an in-memory flag if the containing ColumnFamily is defined in-memory, otherwise a block becomes a single access priority. Once a block is accessed again, it changes to multiple access. This is used to prevent scans from thrashing the cache, adding a least-frequently-used element to the eviction algorithm. Blocks from in-memory ColumnFamilies are the last to be evicted.

    For more information, see the LruBlockCache source

    Write Ahead Log (WAL)

    Purpose

    Each RegionServer adds updates (Puts, Deletes) to its write-ahead log (WAL) first, and then to the the section called “MemStore” for the affected the section called “Store”. This ensures that HBase has durable writes. Without WAL, there is the possibility of data loss in the case of a RegionServer failure before each MemStore is flushed and new StoreFiles are written. HLog is the HBase WAL implementation, and there is one HLog instance per RegionServer.

    The WAL is in HDFS in /hbase/.logs/ with subdirectories per region.

    For more general information about the concept of write ahead logs, see the Wikipedia Write-Ahead Log article.

    WAL Flushing

    TODO (describe).

    WAL Splitting

    How edits are recovered from a crashed RegionServer

    When a RegionServer crashes, it will lose its ephemeral lease in ZooKeeper...TODO

    hbase.hlog.split.skip.errors

    When set to true, the default, any error encountered splitting will be logged, the problematic WAL will be moved into the .corrupt directory under the hbase rootdir, and processing will continue. If set to false, the exception will be propagated and the split logged as failed.[7]

    How EOFExceptions are treated when splitting a crashed RegionServers' WALs

    If we get an EOF while splitting logs, we proceed with the split even when hbase.hlog.split.skip.errors == false. An EOF while reading the last log in the set of files to split is near-guaranteed since the RegionServer likely crashed mid-write of a record. But we'll continue even if we got an EOF reading other than the last file in the set.[8]

    <xi:include></xi:include>

    Chapter 7. Bloom Filters

    Bloom filters were developed over in HBase-1200 Add bloomfilters.[9][10]

    Configurations

    Blooms are enabled by specifying options on a column family in the HBase shell or in java code as specification on org.apache.hadoop.hbase.HColumnDescriptor.

    HColumnDescriptor option

    Use HColumnDescriptor.setBloomFilterType(NONE | ROW | ROWCOL) to enable blooms per Column Family. Default = NONE for no bloom filters. If ROW, the hash of the row will be added to the bloom on each insert. If ROWCOL, the hash of the row + column family + column family qualifier will be added to the bloom on each key insert.

    io.hfile.bloom.enabled global kill switch

    io.hfile.bloom.enabled in Configuration serves as the kill switch in case something goes wrong. Default = true.

    io.hfile.bloom.error.rate

    io.hfile.bloom.error.rate = average false positive rate. Default = 1%. Decrease rate by ½ (e.g. to .5%) == +1 bit per bloom entry.

    io.hfile.bloom.max.fold

    io.hfile.bloom.max.fold = guaranteed minimum fold rate. Most people should leave this alone. Default = 7, or can collapse to at least 1/128th of original size. See the Development Process section of the document BloomFilters in HBase for more on what this option means.

    Bloom StoreFile footprint

    Bloom filters add an entry to the StoreFile general FileInfo data structure and then two extra entries to the StoreFile metadata section.

    BloomFilter in the StoreFile FileInfo data structure

    BLOOM_FILTER_TYPE

    FileInfo has a BLOOM_FILTER_TYPE entry which is set to NONE, ROW or ROWCOL.

    BloomFilter entries in StoreFile metadata

    BLOOM_FILTER_META

    BLOOM_FILTER_META holds Bloom Size, Hash Function used, etc. Its small in size and is cached on StoreFile.Reader load

    BLOOM_FILTER_DATA

    BLOOM_FILTER_DATA is the actual bloomfilter data. Obtained on-demand. Stored in the LRU cache, if it is enabled (Its enabled by default).



    [9] For description of the development process -- why static blooms rather than dynamic -- and for an overview of the unique properties that pertain to blooms in HBase, as well as possible future directions, see the Development Process section of the document BloomFilters in HBase attached to HBase-1200.

    [10] The bloom filters described here are actually version two of blooms in HBase. In versions up to 0.19.x, HBase had a dynamic bloom option based on work done by the European Commission One-Lab Project 034819. The core of the HBase bloom work was later pulled up into Hadoop to implement org.apache.hadoop.io.BloomMapFile. Version 1 of HBase blooms never worked that well. Version 2 is a rewrite from scratch though again it starts with the one-lab work.

    <xi:include></xi:include><xi:include></xi:include><xi:include></xi:include>

    Appendix A. Tools

    Here we list HBase tools for administration, analysis, fixup, and debugging.

    HBase hbck

    An fsck for your HBase install

    To run hbck against your HBase cluster run

    $ ./bin/hbase hbck

    At the end of the commands output it prints OK or INCONSISTENCY. If your cluster reports inconsistencies, pass -details to see more detail emitted. If inconsistencies, run hbck a few times because the inconsistency may be transient (e.g. cluster is starting up or a region is splitting). Passing -fix may correct the inconsistency (This latter is an experimental feature).

    WAL Tools

    HLog tool

    The main method on HLog offers manual split and dump facilities. Pass it WALs or the product of a split, the content of the recovered.edits. directory.

    You can get a textual dump of a WAL file content by doing the following:

     $ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLog --dump hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012 

    The return code will be non-zero if issues with the file so you can test wholesomeness of file by redirecting STDOUT to /dev/null and testing the program return.

    Similarily you can force a split of a log file directory by doing:

     $ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLog --split hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/

    Node Decommission

    You can stop an individual RegionServer by running the following script in the HBase directory on the particular node:

    $ ./bin/hbase-daemon.sh stop regionserver

    The RegionServer will first close all regions and then shut itself down. On shutdown, the RegionServer's ephemeral node in ZooKeeper will expire. The master will notice the RegionServer gone and will treat it as a 'crashed' server; it will reassign the nodes the RegionServer was carrying.

    Disable the Load Balancer before Decommissioning a node

    If the load balancer runs while a node is shutting down, then there could be contention between the Load Balancer and the Master's recovery of the just decommissioned RegionServer. Avoid any problems by disabling the balancer first. See Load Balancer below.

    A downside to the above stop of a RegionServer is that regions could be offline for a good period of time. Regions are closed in order. If many regions on the server, the first region to close may not be back online until all regions close and after the master notices the RegionServer's znode gone. In HBase 0.90.2, we added facility for having a node gradually shed its load and then shutdown itself down. HBase 0.90.2 added the graceful_stop.sh script. Here is its usage:

    $ ./bin/graceful_stop.sh
    Usage: graceful_stop.sh [--config &conf-dir>] [--restart] [--reload] [--thrift] [--rest] &hostname>
     thrift      If we should stop/start thrift before/after the hbase stop/start
     rest        If we should stop/start rest before/after the hbase stop/start
     restart     If we should restart after graceful stop
     reload      Move offloaded regions back on to the stopped server
     debug       Move offloaded regions back on to the stopped server
     hostname    Hostname of server we are to stop

    To decommission a loaded RegionServer, run the following:

    $ ./bin/graceful_stop.sh HOSTNAME

    where HOSTNAME is the host carrying the RegionServer you would decommission.

    On HOSTNAME

    The HOSTNAME passed to graceful_stop.sh must match the hostname that hbase is using to identify RegionServers. Check the list of RegionServers in the master UI for how HBase is referring to servers. Its usually hostname but can also be FQDN. Whatever HBase is using, this is what you should pass the graceful_stop.sh decommission script. If you pass IPs, the script is not yet smart enough to make a hostname (or FQDN) of it and so it will fail when it checks if server is currently running; the graceful unloading of regions will not run.

    The graceful_stop.sh script will move the regions off the decommissioned RegionServer one at a time to minimize region churn. It will verify the region deployed in the new location before it will moves the next region and so on until the decommissioned server is carrying zero regions. At this point, the graceful_stop.sh tells the RegionServer stop. The master will at this point notice the RegionServer gone but all regions will have already been redeployed and because the RegionServer went down cleanly, there will be no WAL logs to split.

    Load Balancer

    It is assumed that the Region Load Balancer is disabled while the graceful_stop script runs (otherwise the balancer and the decommission script will end up fighting over region deployments). Use the shell to disable the balancer:

    hbase(main):001:0> balance_switch false
    true
    0 row(s) in 0.3590 seconds

    This turns the balancer OFF. To reenable, do:

    hbase(main):001:0> balance_switch true
    false
    0 row(s) in 0.3590 seconds

    Rolling Restart

    You can also ask this script to restart a RegionServer after the shutdown AND move its old regions back into place. The latter you might do to retain data locality. A primitive rolling restart might be effected by running something like the following:

    $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &
                

    Tail the output of /tmp/log.txt to follow the scripts progress. The above does RegionServers only. Be sure to disable the load balancer before doing the above. You'd need to do the master update separately. Do it before you run the above script. Here is a pseudo-script for how you might craft a rolling restart script:

    1. Untar your release, make sure of its configuration and then rsync it across the cluster. If this is 0.90.2, patch it with HBASE-3744 and HBASE-3756.

    2. Run hbck to ensure the cluster consistent

      $ ./bin/hbase hbck

      Effect repairs if inconsistent.

    3. Restart the Master:

      $ ./bin/hbase-daemon.sh stop master; ./bin/hbase-daemon.sh start master

    4. Disable the region balancer:

      $ echo "balance_switch false" | ./bin/hbase

    5. Run the graceful_stop.sh script per RegionServer. For example:

      $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &
                  

      If you are running thrift or rest servers on the RegionServer, pass --thrift or --rest options (See usage for graceful_stop.sh script).

    6. Restart the Master again. This will clear out dead servers list and reenable the balancer.

    7. Run hbck to ensure the cluster is consistent.

    CopyTable

    CopyTable is a utility that can copy part or of all of a table, either to the same cluster or another cluster. The usage is as follows:

    $ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable [--rs.class=CLASS] [--rs.impl=IMPL] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] tablename
    

    Options:

    • rs.class hbase.regionserver.class of the peer cluster. Specify if different from current cluster.
    • rs.impl hbase.regionserver.impl of the peer cluster.
    • starttime Beginning of the time range. Without endtime means starttime to forever.
    • endtime End of the time range. Without endtime means starttime to forever.
    • new.name New table's name.
    • peer.adr Address of the peer cluster given in the format hbase.zookeeper.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
    • families Comma-separated list of ColumnFamilies to copy.

    Args:

    • tablename Name of table to copy.

    Example of copying 'TestTable' to a cluster that uses replication for a 1 hour window:

    $ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable
    --rs.class=org.apache.hadoop.hbase.ipc.ReplicationRegionInterface
    --rs.impl=org.apache.hadoop.hbase.regionserver.replication.ReplicationRegionServer
    --starttime=1265875194289 --endtime=1265878794289
    --peer.adr=server1,server2,server3:2181:/hbase TestTable

    Appendix B. Compression In HBase

    CompressionTest Tool

    HBase includes a tool to test compression is set up properly. To run it, type /bin/hbase org.apache.hadoop.hbase.util.CompressionTest. This will emit usage on how to run the tool.

    hbase.regionserver.codecs

    To have a RegionServer test a set of codecs and fail-to-start if any code is missing or misinstalled, add the configuration hbase.regionserver.codecs to your hbase-site.xml with a value of codecs to test on startup. For example if the hbase.regionserver.codecs value is lzo,gz and if lzo is not present or improperly installed, the misconfigured RegionServer will fail to start.

    Administrators might make use of this facility to guard against the case where a new server is added to cluster but the cluster requires install of a particular coded.

    LZO

    Unfortunately, HBase cannot ship with LZO because of the licensing issues; HBase is Apache-licensed, LZO is GPL. Therefore LZO install is to be done post-HBase install. See the Using LZO Compression wiki page for how to make LZO work with HBase.

    A common problem users run into when using LZO is that while initial setup of the cluster runs smooth, a month goes by and some sysadmin goes to add a machine to the cluster only they'll have forgotten to do the LZO fixup on the new machine. In versions since HBase 0.90.0, we should fail in a way that makes it plain what the problem is, but maybe not.

    See the section called “ hbase.regionserver.codecs for a feature to help protect against failed LZO install.

    GZIP

    GZIP will generally compress better than LZO though slower. For some setups, better compression may be preferred. Java will use java's GZIP unless the native Hadoop libs are available on the CLASSPATH; in this case it will use native compressors instead (If the native libs are NOT present, you will see lots of Got brand-new compressor reports in your logs; see Q: ).

    SNAPPY

    If snappy is installed, HBase can make use of it (courtesy of hadoop-snappy [11]).

    1. Build and install snappy on all nodes of your cluster.

    2. Use CompressionTest to verify snappy support is enabled and the libs can be loaded ON ALL NODES of your cluster:

      $ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://host/path/to/hbase snappy

    3. Create a column family with snappy compression and verify it in the hbase shell:

      $ hbase> create 't1', { NAME => 'cf1', COMPRESSION => 'SNAPPY' }
      hbase> describe 't1'

      In the output of the "describe" command, you need to ensure it lists "COMPRESSION => 'SNAPPY'"



    [11] See Alejandro's note up on the list on difference between Snappy in Hadoop and Snappy in HBase

    Appendix C. FAQ

    C.1. General
    Are there other HBase FAQs?
    Does HBase support SQL?
    How does HBase work on top of HDFS?
    Can I change a table's rowkeys?
    Why are logs flooded with '2011-01-10 12:40:48,407 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor' messages?
    C.2. EC2
    Why doesn't my remote java connection into my ec2 cluster work?
    C.3. Building HBase
    When I build, why do I always get Unable to find resource 'VM_global_library.vm'?
    C.4. Runtime
    I'm having problems with my HBase cluster, how can I troubleshoot it?
    How can I improve HBase cluster performance?
    C.5. How do I...?
    Secondary Indexes in HBase?
    Store (fill in the blank) in HBase?
    Back up my HBase Cluster?

    C.1. General

    Are there other HBase FAQs?
    Does HBase support SQL?
    How does HBase work on top of HDFS?
    Can I change a table's rowkeys?
    Why are logs flooded with '2011-01-10 12:40:48,407 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor' messages?

    Are there other HBase FAQs?

    See the FAQ that is up on the wiki, HBase Wiki FAQ.

    Does HBase support SQL?

    Not really. SQL-ish support for HBase via Hive is in development, however Hive is based on MapReduce which is not generally suitable for low-latency requests. See the Chapter 5, Data Model section for examples on the HBase client.

    How does HBase work on top of HDFS?

    HDFS is a distributed file system that is well suited for the storage of large files. It's documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. See the Chapter 5, Data Model and Chapter 6, Architecture sections for more information on how HBase achieves its goals.

    Can I change a table's rowkeys?

    No. See the section called “ Immutability of Rowkeys ”.

    Why are logs flooded with '2011-01-10 12:40:48,407 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor' messages?

    Because we are not using the native versions of compression libraries. See HBASE-1900 Put back native support when hadoop 0.21 is released. Copy the native libs from hadoop under hbase lib dir or symlink them into place and the message should go away.

    C.2. EC2

    Why doesn't my remote java connection into my ec2 cluster work?

    Why doesn't my remote java connection into my ec2 cluster work?

    See Andrew's answer here, up on the user list: Remote Java client connection into EC2 instance.

    C.3. Building HBase

    When I build, why do I always get Unable to find resource 'VM_global_library.vm'?

    When I build, why do I always get Unable to find resource 'VM_global_library.vm'?

    Ignore it. Its not an error. It is officially ugly though.

    C.4. Runtime

    I'm having problems with my HBase cluster, how can I troubleshoot it?
    How can I improve HBase cluster performance?

    I'm having problems with my HBase cluster, how can I troubleshoot it?

    See ???.

    How can I improve HBase cluster performance?

    See ???.

    C.5. How do I...?

    Secondary Indexes in HBase?
    Store (fill in the blank) in HBase?
    Back up my HBase Cluster?

    Secondary Indexes in HBase?

    See the section called “ Secondary Indexes and Alternate Query Paths ”

    Store (fill in the blank) in HBase?

    See the section called “ Supported Datatypes ”.

    Back up my HBase Cluster?

    See HBase Backup Options over on the Sematext Blog.

    TODO: Describe how YCSB is poor for putting up a decent cluster load.

    TODO: Describe setup of YCSB for HBase

    Ted Dunning redid YCSB so its mavenized and added facility for verifying workloads. See Ted Dunning's YCSB.

    Appendix E. HFile format version 2

    Mikhail Bautin

    Liyin Tang

    Kannan Muthukarrupan

    Motivation

    We found it necessary to revise the HFile format after encountering high memory usage and slow startup times caused by large Bloom filters and block indexes in the region server. Bloom filters can get as large as 100 MB per HFile, which adds up to 2 GB when aggregated over 20 regions. Block indexes can grow as large as 6 GB in aggregate size over the same set of regions. A region is not considered opened until all of its block index data is loaded. Large Bloom filters produce a different performance problem: the first get request that requires a Bloom filter lookup will incur the latency of loading the entire Bloom filter bit array.

    To speed up region server startup we break Bloom filters and block indexes into multiple blocks and write those blocks out as they fill up, which also reduces the HFile writer’s memory footprint. In the Bloom filter case, “filling up a block” means accumulating enough keys to efficiently utilize a fixed-size bit array, and in the block index case we accumulate an “index block” of the desired size. Bloom filter blocks and index blocks (we call these “inline blocks”) become interspersed with data blocks, and as a side effect we can no longer rely on the difference between block offsets to determine data block length, as it was done in version 1.

    HFile is a low-level file format by design, and it should not deal with application-specific details such as Bloom filters, which are handled at StoreFile level. Therefore, we call Bloom filter blocks in an HFile "inline" blocks. We also supply HFile with an interface to write those inline blocks.

    Another format modification aimed at reducing the region server startup time is to use a contiguous “load-on-open” section that has to be loaded in memory at the time an HFile is being opened. Currently, as an HFile opens, there are separate seek operations to read the trailer, data/meta indexes, and file info. To read the Bloom filter, there are two more seek operations for its “data” and “meta” portions. In version 2, we seek once to read the trailer and seek again to read everything else we need to open the file from a contiguous block.

    HFile format version 1 overview

    As we will be discussing the changes we are making to the HFile format, it is useful to give a short overview of the previous (HFile version 1) format. An HFile in the existing format is structured as follows: HFile Version 1 [12]

    Block index format in version 1

    The block index in version 1 is very straightforward. For each entry, it contains:

    1. Offset (long)

    2. Uncompressed size (int)

    3. Key (a serialized byte array written using Bytes.writeByteArray)

      1. Key length as a variable-length integer (VInt)

      2. Key bytes

    The number of entries in the block index is stored in the fixed file trailer, and has to be passed in to the method that reads the block index. One of the limitations of the block index in version 1 is that it does not provide the compressed size of a block, which turns out to be necessary for decompression. Therefore, the HFile reader has to infer this compressed size from the offset difference between blocks. We fix this limitation in version 2, where we store on-disk block size instead of uncompressed size, and get uncompressed size from the block header.

    HBase file format with inline blocks (version 2)

    Overview

    The version of HBase introducing the above features reads both version 1 and 2 HFiles, but only writes version 2 HFiles. A version 2 HFile is structured as follows: HFile Version 2

    Unified version 2 block format

    In the version 2 every block in the data section contains the following fields:

    1. 8 bytes: Block type, a sequence of bytes equivalent to version 1's "magic records". Supported block types are:

      1. DATA – data blocks

      2. LEAF_INDEX – leaf-level index blocks in a multi-level-block-index

      3. BLOOM_CHUNK – Bloom filter chunks

      4. META – meta blocks (not used for Bloom filters in version 2 anymore)

      5. INTERMEDIATE_INDEX – intermediate-level index blocks in a multi-level blockindex

      6. ROOT_INDEX – root>level index blocks in a multi>level block index

      7. FILE_INFO – the “file info” block, a small key>value map of metadata

      8. BLOOM_META – a Bloom filter metadata block in the load>on>open section

      9. TRAILER – a fixed>size file trailer. As opposed to the above, this is not an HFile v2 block but a fixed>size (for each HFile version) data structure

      10. INDEX_V1 – this block type is only used for legacy HFile v1 block

    2. Compressed size of the block's data, not including the header (int).

      Can be used for skipping the current data block when scanning HFile data.

    3. Uncompressed size of the block's data, not including the header (int)

      This is equal to the compressed size if the compression algorithm is NON

    4. File offset of the previous block of the same type (long)

      Can be used for seeking to the previous data/index block

    5. Compressed data (or uncompressed data if the compression algorithm is NONE).

    The above format of blocks is used in the following HFile sections:

    1. Scanned block section. The section is named so because it contains all data blocks that need to be read when an HFile is scanned sequentially.  Also contains leaf block index and Bloom chunk blocks.

    2. Non-scanned block section. This section still contains unified-format v2 blocks but it does not have to be read when doing a sequential scan. This section contains “meta” blocks and intermediate-level index blocks.

    We are supporting “meta” blocks in version 2 the same way they were supported in version 1, even though we do not store Bloom filter data in these blocks anymore.

    Block index in version 2

    There are three types of block indexes in HFile version 2, stored in two different formats (root and non-root):

    1. Data index — version 2 multi-level block index, consisting of:

      1. Version 2 root index, stored in the data block index section of the file

      2. Optionally, version 2 intermediate levels, stored in the non%root format in the data index section of the file. Intermediate levels can only be present if leaf level blocks are present

      3. Optionally, version 2 leaf levels, stored in the non%root format inline with data blocks

    2. Meta index — version 2 root index format only, stored in the meta index section of the file

    3. Bloom index — version 2 root index format only, stored in the “load-on-open” section as part of Bloom filter metadata.

    Root block index format in version 2

    This format applies to:

    1. Root level of the version 2 data index

    2. Entire meta and Bloom indexes in version 2, which are always single-level.

    A version 2 root index block is a sequence of entries of the following format, similar to entries of a version 1 block index, but storing on-disk size instead of uncompressed size.

    1. Offset (long)

      This offset may point to a data block or to a deeper>level index block.

    2. On-disk size (int)

    3. Key (a serialized byte array stored using Bytes.writeByteArray)

      1. Key (VInt)

      2. Key bytes

    A single-level version 2 block index consists of just a single root index block. To read a root index block of version 2, one needs to know the number of entries. For the data index and the meta index the number of entries is stored in the trailer, and for the Bloom index it is stored in the compound Bloom filter metadata.

    For a multi-level block index we also store the following fields in the root index block in the load-on-open section of the HFile, in addition to the data structure described above:

    1. Middle leaf index block offset

    2. Middle leaf block on-disk size (meaning the leaf index block containing the reference to the “middle” data block of the file)

    3. The index of the mid-key (defined below) in the middle leaf-level block.

    These additional fields are used to efficiently retrieve the mid-key of the HFile used in HFile splits, which we define as the first key of the block with a zero-based index of (n – 1) / 2, if the total number of blocks in the HFile is n. This definition is consistent with how the mid-key was determined in HFile version 1, and is reasonable in general, because blocks are likely to be the same size on average, but we don’t have any estimates on individual key/value pair sizes.

    When writing a version 2 HFile, the total number of data blocks pointed to by every leaf-level index block is kept track of. When we finish writing and the total number of leaf-level blocks is determined, it is clear which leaf-level block contains the mid-key, and the fields listed above are computed.  When reading the HFile and the mid-key is requested, we retrieve the middle leaf index block (potentially from the block cache) and get the mid-key value from the appropriate position inside that leaf block.

    Non-root block index format in version 2

    This format applies to intermediate-level and leaf index blocks of a version 2 multi-level data block index. Every non-root index block is structured as follows.

    1. numEntries: the number of entries (int).

    2. entryOffsets: the “secondary index” of offsets of entries in the block, to facilitate a quick binary search on the key (numEntries + 1 int values). The last value is the total length of all entries in this index block. For example, in a non-root index block with entry sizes 60, 80, 50 the “secondary index” will contain the following int array: {0, 60, 140, 190}.

    3. Entries. Each entry contains:

      1. Offset of the block referenced by this entry in the file (long)

      2. On>disk size of the referenced block (int)

      3. Key. The length can be calculated from entryOffsets.

    Bloom filters in version 2

    In contrast with version 1, in a version 2 HFile Bloom filter metadata is stored in the load-on-open section of the HFile for quick startup.

    1. A compound Bloom filter.

      1. Bloom filter version = 3 (int). There used to be a DynamicByteBloomFilter class that had the Bloom filter version number 2

      2. The total byte size of all compound Bloom filter chunks (long)

      3. Number of hash functions (int

      4. Type of hash functions (int)

      5. The total key count inserted into the Bloom filter (long)

      6. The maximum total number of keys in the Bloom filter (long)

      7. The number of chunks (int)

      8. Comparator class used for Bloom filter keys, a UTF>8 encoded string stored using Bytes.writeByteArray

      9. Bloom block index in the version 2 root block index format

    File Info format in versions 1 and 2

    The file info block is a serialized HbaseMapWritable (essentially a map from byte arrays to byte arrays) with the following keys, among others. StoreFile-level logic adds more keys to this.

    hfile.LASTKEY

    The last key of the file (byte array)

    hfile.AVG_KEY_LEN

    The average key length in the file (int)

    hfile.AVG_VALUE_LEN

    The average value length in the file (int)

    File info format did not change in version 2. However, we moved the file info to the final section of the file, which can be loaded as one block at the time the HFile is being opened. Also, we do not store comparator in the version 2 file info anymore. Instead, we store it in the fixed file trailer. This is because we need to know the comparator at the time of parsing the load-on-open section of the HFile.

    Fixed file trailer format differences between versions 1 and 2

    The following table shows common and different fields between fixed file trailers in versions 1 and 2. Note that the size of the trailer is different depending on the version, so it is “fixed” only within one version. However, the version is always stored as the last four-byte integer in the file.

    Version 1

    Version 2

    File info offset (long)

    Data index offset (long)

    loadOnOpenOffset (long)

    The offset of the section that we need toload when opening the file.

    Number of data index entries (int)

    metaIndexOffset (long)

    This field is not being used by the version 1 reader, so we removed it from version 2.

    uncompressedDataIndexSize (long)

    The total uncompressed size of the whole data block index, including root-level, intermediate-level, and leaf-level blocks.

    Number of meta index entries (int)

    Total uncompressed bytes (long)

    numEntries (int)

    numEntries (long)

    Compression codec: 0 = LZO, 1 = GZ, 2 = NONE (int)

    The number of levels in the data block index (int)

    firstDataBlockOffset (long)

    The offset of the first first data block. Used when scanning.

    lastDataBlockEnd (long)

    The offset of the first byte after the last key/value data block. We don't need to go beyond this offset when scanning.

    Version: 1 (int)

    Version: 2 (int)



    [12] Image courtesy of Lars George, hbase-architecture-101-storage.html.

    Index

    C

    Cells, Cells
    Column Family, Column Family
    Column Family Qualifier, Column Family
    Compression, Compression In HBase

    V

    Versions, Versions