Index: src/docbkx/book.xml
===================================================================
--- src/docbkx/book.xml (revision 1180918)
+++ src/docbkx/book.xml (working copy)
@@ -312,7 +312,7 @@
A good general introduction on the strength and weaknesses modelling on
the various non-rdbms datastores is Ian Varleys' Master thesis,
No Relation: The Mixed Blessings of Non-Relational Databases.
- Recommended.
+ Recommended. Also, read for how HBase stores data internally.
@@ -400,7 +400,7 @@
Most of the time small inefficiencies don't matter all that much. Unfortunately,
this is a case where they do. Whatever patterns are selected for ColumnFamilies, attributes, and rowkeys they could be repeated
- several billion times in your data
+ several billion times in your data. See for more information on HBase stores data internally.
Column FamiliesTry to keep the ColumnFamily names as small as possible, preferably one character (e.g. "d" for data/default).
@@ -1615,6 +1615,8 @@
Schubert Zhang's blog post on HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs makes for a thorough introduction to HBase's hfile. Matteo Bertozzi has also put up a
helpful description, HBase I/O: HFile.
+ For more information, see the HFile source code.
+
@@ -1631,6 +1633,40 @@
tool.
+
+ Blocks
+ StoreFiles are composed of blocks. The blocksize is configured on a per-ColumnFamily basis.
+
+ For more information, see the HFileBlock source code.
+
+
+
+ KeyValue
+ The KeyValue class is the heart of data storage in HBase. KeyValue wraps a byte array and takes offsets and lengths into passed array
+ at where to start interpreting the content as KeyValue.
+
+ The KeyValue format inside a byte array is:
+
+ keylength
+ valuelength
+ key
+ value
+
+
+ The Key is further decomposed as:
+
+ rowlength
+ row (i.e., the rowkey)
+ columnfamilylength
+ columnfamily
+ columnqualifier
+ timestamp
+ keytype (e.g., Put, Delete)
+
+
+ For more information, see the KeyValue source code.
+
+ CompactionThere are two types of compactions: minor and major. Minor compactions will usually pick up a couple of the smaller adjacent
Index: src/docbkx/ops_mgt.xml
===================================================================
--- src/docbkx/ops_mgt.xml (revision 1180918)
+++ src/docbkx/ops_mgt.xml (working copy)
@@ -301,6 +301,32 @@
Since the cluster is up, there is a risk that edits could be missed in the export process.
+
+ Capacity Planning
+ Storage
+ A common question for HBase administrators is estimating how much storage will be required for an HBase cluster.
+ There are several apsects to consider, the most important of which is what data load into the cluster. Start
+ with a solid understanding of how HBase handles data internally (KeyValue).
+
+ KeyValue
+ HBase storage will be dominated by KeyValues. See and for
+ how HBase stores data internally.
+
+ It is critical to understand that there is a KeyValue instance for every attribute stored in a row, and the
+ rowkey-length, ColumnFamily name-length and attribute lengths will drive the size of the database more than any other
+ factor.
+
+
+ StoreFiles and Blocks
+ KeyValue instances are aggregated into blocks, and the blocksize is configurable on a per-ColumnFamily basis.
+ Blocks are aggregated into StoreFile's. See .
+
+
+ HDFS Block Replication
+ Because HBase runs on top of HDFS, factor in HDFS block replication into storage calculations.
+
+
+