Index: src/docbkx/book.xml =================================================================== --- src/docbkx/book.xml (revision 1414399) +++ src/docbkx/book.xml (working copy) @@ -739,6 +739,43 @@ inserted a lot of data). +
Relationship Between RowKeys and Region Splits + If you pre-split your table, it is critical to understand how your rowkey will be distributed across + the region boundaries. As an example of why this is important, consider the example of using displayable hex characters as the + lead position of the key (e.g., ""0000000000000000" to "ffffffffffffffff"). Running those key ranges through Bytes.split + (which is the split strategy used when creating regions in HBaseAdmin.createTable(byte[] startKey, byte[] endKey, numRegions) + for 10 regions will generate the following splits... + + + +48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 // 0 +54 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 // 6 +61 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -68 // = +68 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -126 // D +75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 72 // K +82 18 18 18 18 18 18 18 18 18 18 18 18 18 18 14 // R +88 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -44 // X +95 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -102 // _ +102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 // f + + ... (note: the lead byte is listed to the right as a comment.) Given that the first split is a '0' and the last split is an 'f', + everything is great, right? Not so fast. + + The problem is that all the data is going to pile up in the first 2 regions and the last region thus creating a "lumpy" (and + possibly "hot") region problem. To understand why, refer to an ASCII Table. + '0' is byte 48, and 'f' is byte 102, but there is a huge gap in byte values (bytes 58 to 96) that will never appear in this + keyspace because the only values are [0-9] and [a-f]. Thus, the middle regions regions will + never be used. To make pre-spliting work with this example keyspace, a custom definition of splits (i.e., and not relying on the + built-in split method) is required. + + Lesson #1: Pre-splitting tables is generally a best practice, but you need to pre-split them in such a way that all the + regions are accessible in the keyspace. While this example demonstrated the problem with a hex-key keyspace, the same problem can happen + with any keyspace. Know your data. + + Lesson #2: While generally not advisable, using hex-keys (and more generally, displayable data) can still work with pre-split + tables as long as all the created regions are accessible in the keyspace. + +