Index: src/docbkx/book.xml
===================================================================
--- src/docbkx/book.xml (revision 1414399)
+++ src/docbkx/book.xml (working copy)
@@ -739,6 +739,43 @@
inserted a lot of data).
+ Relationship Between RowKeys and Region Splits
+ If you pre-split your table, it is critical to understand how your rowkey will be distributed across
+ the region boundaries. As an example of why this is important, consider the example of using displayable hex characters as the
+ lead position of the key (e.g., ""0000000000000000" to "ffffffffffffffff"). Running those key ranges through Bytes.split
+ (which is the split strategy used when creating regions in HBaseAdmin.createTable(byte[] startKey, byte[] endKey, numRegions)
+ for 10 regions will generate the following splits...
+
+
+
+48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 // 0
+54 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 // 6
+61 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -68 // =
+68 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -126 // D
+75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 72 // K
+82 18 18 18 18 18 18 18 18 18 18 18 18 18 18 14 // R
+88 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -44 // X
+95 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -102 // _
+102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 // f
+
+ ... (note: the lead byte is listed to the right as a comment.) Given that the first split is a '0' and the last split is an 'f',
+ everything is great, right? Not so fast.
+
+ The problem is that all the data is going to pile up in the first 2 regions and the last region thus creating a "lumpy" (and
+ possibly "hot") region problem. To understand why, refer to an ASCII Table.
+ '0' is byte 48, and 'f' is byte 102, but there is a huge gap in byte values (bytes 58 to 96) that will never appear in this
+ keyspace because the only values are [0-9] and [a-f]. Thus, the middle regions regions will
+ never be used. To make pre-spliting work with this example keyspace, a custom definition of splits (i.e., and not relying on the
+ built-in split method) is required.
+
+ Lesson #1: Pre-splitting tables is generally a best practice, but you need to pre-split them in such a way that all the
+ regions are accessible in the keyspace. While this example demonstrated the problem with a hex-key keyspace, the same problem can happen
+ with any keyspace. Know your data.
+
+ Lesson #2: While generally not advisable, using hex-keys (and more generally, displayable data) can still work with pre-split
+ tables as long as all the created regions are accessible in the keyspace.
+
+