HBase has the following requirements. Please read the section below carefully and ensure that all requirements have been satisfied. Failure to do so will cause you (and us) grief debugging strange errors and/or data loss.
Just like Hadoop, HBase requires java 6 from Oracle. Usually you'll want to use the latest version available except the problematic u18 (u22 is the latest version as of this writing).
This version of HBase will only run on Hadoop 0.20.x. HBase will lose data unless it is running on an HDFS that has a durable sync. Currently only the branch-0.20-append branch has this attribute. No official releases have been made from this branch as of this writing so you will have to build your own Hadoop from the tip of this branch (or install Cloudera's CDH3 (as of this writing, it is in beta); it has the 0.20-append patches needed to add a durable sync). See CHANGES.txt in branch-0.20.-append to see list of patches involved.
ssh must be installed and sshd must be running to use Hadoop's scripts to manage remote Hadoop daemons. You must be able to ssh to all nodes, including your local node, using passwordless login (Google "ssh passwordless login").
The clocks on cluster members should be in basic alignments. Some skew is tolerable but wild skew could generate odd behaviors. Run NTP on your cluster, or an equivalent.
HBase is a database, it uses a lot of files at the same time. The default ulimit -n of 1024 on *nix systems is insufficient. Any significant amount of loading will lead you to FAQ: Why do I see "java.io.IOException...(Too many open files)" in my logs?. You will also notice errors like:
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901
Do yourself a favor and change the upper bound on the number of file descriptors. Set it to north of 10k. See the above referenced FAQ for how.
To be clear, upping the file descriptors for the user who is running the HBase process is an operating system configuration, not an HBase configuration.
Hadoop HDFS has an upper bound of files that it will serve at one same time,
called xcievers (yes, this is misspelled). Again, before
doing any loading, make sure you have configured Hadoop's conf/hdfs-site.xml
setting the xceivers value to at least the following:
<property>
<name>dfs.datanode.max.xcievers</name>
<value>2047</value>
</property>
HBase has three different run modes: standalone, this is what is described above in Quick Start, pseudo-distributed mode where all daemons run on a single server, and distributed, where each of the daemons runs on different cluster node.
In this section we provide a few sample configurations.
Here is example basic configuration of a ten node cluster running in
distributed mode. The nodes
are named example0, example1, etc., through
node example9 in this example. The HBase Master and the HDFS namenode
are running on the node example0. RegionServers run on nodes
example1-example9.
A 3-node zookeeper ensemble runs on example1, example2, and example3.
Below we show what the main configuration files
-- hbase-site.xml, regionservers, and
hbase-env.sh -- found in the conf directory
might look like.
<configuration>
<property>
<name>hbase.zookeeper.quorum</name>
<value>example1,example2,example3</value>
<description>The directory shared by region servers.
</description>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/export/stack/zookeeper</value>
<description>Property from ZooKeeper's config zoo.cfg.
The directory where the snapshot is stored.
</description>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://example1:9000/hbase</value>
<description>The directory shared by region servers.
</description>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
<description>The mode the cluster will be in. Possible values are
false: standalone and pseudo-distributed setups with managed Zookeeper
true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
</description>
</property>
</configuration>
In this file you list the nodes that will run regionservers. In our case we run regionservers on all but the head node example1 which is carrying the HBase master and the HDFS namenode
example1
example3
example4
example5
example6
example7
example8
example9
Below we use a diff to show the differences from
default in the hbase-env.sh file. Here we are setting
the HBase heap to be 4G instead of the default 1G.
$ git diff hbase-env.sh
diff --git a/conf/hbase-env.sh b/conf/hbase-env.sh
index e70ebc6..96f8c27 100644
--- a/conf/hbase-env.sh
+++ b/conf/hbase-env.sh
@@ -31,7 +31,7 @@ export JAVA_HOME=/usr/lib//jvm/java-6-sun/
# export HBASE_CLASSPATH=
# The maximum amount of heap to use, in MB. Default is 1000.
-# export HBASE_HEAPSIZE=1000
+export HBASE_HEAPSIZE=4096
# Extra Java runtime options.
# Below are what we set by default. May only work with SUN JVM.