1.2. Not-so-quick Start Guide

1.2.1. Requirements

HBase has the following requirements. Please read the section below carefully and ensure that all requirements have been satisfied. Failure to do so will cause you (and us) grief debugging strange errors and/or data loss.

1.2.1.1. java

Just like Hadoop, HBase requires java 6 from Oracle. Usually you'll want to use the latest version available except the problematic u18 (u22 is the latest version as of this writing).

1.2.1.2. Hadoop

This version of HBase will only run on Hadoop 0.20.x. HBase will lose data unless it is running on an HDFS that has a durable sync. Currently only the branch-0.20-append branch has this attribute. No official releases have been made from this branch as of this writing so you will have to build your own Hadoop from the tip of this branch (or install Cloudera's CDH3 (as of this writing, it is in beta); it has the 0.20-append patches needed to add a durable sync). See CHANGES.txt in branch-0.20.-append to see list of patches involved.

1.2.1.3. ssh

ssh must be installed and sshd must be running to use Hadoop's scripts to manage remote Hadoop daemons. You must be able to ssh to all nodes, including your local node, using passwordless login (Google "ssh passwordless login").

1.2.1.4. DNS

Basic name resolving must be working correctly on your cluster.

1.2.1.5. NTP

The clocks on cluster members should be in basic alignments. Some skew is tolerable but wild skew could generate odd behaviors. Run NTP on your cluster, or an equivalent.

1.2.1.6. ulimit

HBase is a database, it uses a lot of files at the same time. The default ulimit -n of 1024 on *nix systems is insufficient. Any significant amount of loading will lead you to FAQ: Why do I see "java.io.IOException...(Too many open files)" in my logs?. You will also notice errors like:

2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException
      2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901
      

Do yourself a favor and change the upper bound on the number of file descriptors. Set it to north of 10k. See the above referenced FAQ for how.

To be clear, upping the file descriptors for the user who is running the HBase process is an operating system configuration, not an HBase configuration.

1.2.1.7. dfs.datanode.max.xcievers

Hadoop HDFS has an upper bound of files that it will serve at one same time, called xcievers (yes, this is misspelled). Again, before doing any loading, make sure you have configured Hadoop's conf/hdfs-site.xml setting the xceivers value to at least the following:

      <property>
        <name>dfs.datanode.max.xcievers</name>
        <value>2047</value>
      </property>
      

1.2.2. HBase run modes: Standalone, Pseudo-distributed, and Distributed

HBase has three different run modes: standalone, this is what is described above in Quick Start, pseudo-distributed mode where all daemons run on a single server, and distributed, where each of the daemons runs on different cluster node.

1.2.2.1. Standalone HBase

TODO

1.2.2.2. Pseudo-distributed

TODO

1.2.2.3. Distributed

TODO

1.2.3. Client configuration and dependencies connecting to an HBase cluster

TODO

1.2.4. Example Configurations

In this section we provide a few sample configurations.

1.2.4.1. Basic Distributed HBase Install

Here is example basic configuration of a ten node cluster running in distributed mode. The nodes are named example0, example1, etc., through node example9 in this example. The HBase Master and the HDFS namenode are running on the node example0. RegionServers run on nodes example1-example9. A 3-node zookeeper ensemble runs on example1, example2, and example3. Below we show what the main configuration files -- hbase-site.xml, regionservers, and hbase-env.sh -- found in the conf directory might look like.

1.2.4.1.1. hbase-site.xml

<configuration>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>example1,example2,example3</value>
    <description>The directory shared by region servers.
    </description>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/export/stack/zookeeper</value>
    <description>Property from ZooKeeper's config zoo.cfg.
    The directory where the snapshot is stored.
    </description>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://example1:9000/hbase</value>
    <description>The directory shared by region servers.
    </description>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>The mode the cluster will be in. Possible values are
      false: standalone and pseudo-distributed setups with managed Zookeeper
      true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
    </description>
  </property>
</configuration>

    
1.2.4.1.2. regionservers

In this file you list the nodes that will run regionservers. In our case we run regionservers on all but the head node example1 which is carrying the HBase master and the HDFS namenode

    example1
    example3
    example4
    example5
    example6
    example7
    example8
    example9
    
1.2.4.1.3. hbase-env.sh

Below we use a diff to show the differences from default in the hbase-env.sh file. Here we are setting the HBase heap to be 4G instead of the default 1G.

    
$ git diff hbase-env.sh
diff --git a/conf/hbase-env.sh b/conf/hbase-env.sh
index e70ebc6..96f8c27 100644
--- a/conf/hbase-env.sh
+++ b/conf/hbase-env.sh
@@ -31,7 +31,7 @@ export JAVA_HOME=/usr/lib//jvm/java-6-sun/
 # export HBASE_CLASSPATH=
 
 # The maximum amount of heap to use, in MB. Default is 1000.
-# export HBASE_HEAPSIZE=1000
+export HBASE_HEAPSIZE=4096
 
 # Extra Java runtime options.
 # Below are what we set by default.  May only work with SUN JVM.