Index: src/docbkx/book.xml =================================================================== --- src/docbkx/book.xml (revision 1180918) +++ src/docbkx/book.xml (working copy) @@ -312,7 +312,7 @@ A good general introduction on the strength and weaknesses modelling on the various non-rdbms datastores is Ian Varleys' Master thesis, No Relation: The Mixed Blessings of Non-Relational Databases. - Recommended. + Recommended. Also, read for how HBase stores data internally.
@@ -400,7 +400,7 @@ </para> <para>Most of the time small inefficiencies don't matter all that much. Unfortunately, this is a case where they do. Whatever patterns are selected for ColumnFamilies, attributes, and rowkeys they could be repeated - several billion times in your data</para> + several billion times in your data. See <xref linkend="keyvalue"/> for more information on HBase stores data internally.</para> <section xml:id="keysize.cf"><title>Column Families Try to keep the ColumnFamily names as small as possible, preferably one character (e.g. "d" for data/default). @@ -1615,6 +1615,8 @@ Schubert Zhang's blog post on HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs makes for a thorough introduction to HBase's hfile. Matteo Bertozzi has also put up a helpful description, HBase I/O: HFile. + For more information, see the HFile source code. +
@@ -1631,6 +1633,40 @@ tool.
+
+ Blocks + StoreFiles are composed of blocks. The blocksize is configured on a per-ColumnFamily basis. + + For more information, see the HFileBlock source code. + +
+
+ KeyValue + The KeyValue class is the heart of data storage in HBase. KeyValue wraps a byte array and takes offsets and lengths into passed array + at where to start interpreting the content as KeyValue. + + The KeyValue format inside a byte array is: + + keylength + valuelength + key + value + + + The Key is further decomposed as: + + rowlength + row (i.e., the rowkey) + columnfamilylength + columnfamily + columnqualifier + timestamp + keytype (e.g., Put, Delete) + + + For more information, see the KeyValue source code. + +
Compaction There are two types of compactions: minor and major. Minor compactions will usually pick up a couple of the smaller adjacent Index: src/docbkx/ops_mgt.xml =================================================================== --- src/docbkx/ops_mgt.xml (revision 1180918) +++ src/docbkx/ops_mgt.xml (working copy) @@ -301,6 +301,32 @@ Since the cluster is up, there is a risk that edits could be missed in the export process.
+ +
Capacity Planning +
Storage + A common question for HBase administrators is estimating how much storage will be required for an HBase cluster. + There are several apsects to consider, the most important of which is what data load into the cluster. Start + with a solid understanding of how HBase handles data internally (KeyValue). + +
KeyValue + HBase storage will be dominated by KeyValues. See and for + how HBase stores data internally. + + It is critical to understand that there is a KeyValue instance for every attribute stored in a row, and the + rowkey-length, ColumnFamily name-length and attribute lengths will drive the size of the database more than any other + factor. + +
+
StoreFiles and Blocks + KeyValue instances are aggregated into blocks, and the blocksize is configurable on a per-ColumnFamily basis. + Blocks are aggregated into StoreFile's. See . + +
+
HDFS Block Replication + Because HBase runs on top of HDFS, factor in HDFS block replication into storage calculations. + +
+