diff --git src/main/docbkx/schema_design.xml src/main/docbkx/schema_design.xml index de05c14..86971e7 100644 --- src/main/docbkx/schema_design.xml +++ src/main/docbkx/schema_design.xml @@ -99,6 +99,47 @@ admin.enableTable(table);
Rowkey Design +
+ Hotspotting + Rows in HBase are sorted lexicographically by row key. This design optimizes for scans, + allowing you to store related rows, or rows that will be read together, near each other. + However, poorly designed row keys are a common source of hotspotting. + Hotspotting occurs when a large amount of client traffic is directed at one node, or only a + few nodes, of a cluster. This traffic may represent reads, writes, or other operations. The + traffic overwhelms the single machine responsible for hosting that region, causing + performance degradation and potentially leading to region unavailability. This can also have + adverse effects on other regions hosted by the same region server as that host is unable to + service the requested load. It is important to design data access patterns such that the + cluster is fully and evenly utilized. + To prevent hotspotting on writes, design your row keys such that rows that truly do need + to be in the same region are, but in the bigger picture, data is being written to multiple + regions across the cluster, rather than one at a time. One technique is to add a salt or + hash to row keys which would otherwise be sequential and don't need to be. + Salting in this sense has nothing to do with cryptography, but refers to adding random + data to the start of a row key. In this case, salting refers to adding a prefix to the row + key to cause it to sort differently than it otherwise would. Salting can be helpful if you + have a few keys that come up over and over, along with other rows that don't fit those keys. + In that case, the regions holding rows with the "hot" keys would be overloaded, compared to + the other regions. Salting completely removes ordering, so is often a poorer choice than + hashing. Using totally random row keys for data which is accessed sequentially would remove + the benefit of HBase's row-sorting algorithm and cause very poor performance, as each get or + scan would need to query all regions. + Hashing refers to applying a random one-way function to the row key, such that a + particular row always gets the same arbitrary value applied. This preserves the sort order + so that scans are effective, but spreads out load across a region. One example where hashing + is the right strategy would be if for some reason, a large proportion of rows started with + the same letter. Normally, these would all be sorted into the same region. You can apply a + hash to artificially differentiate them and spread them out. + A third common trick for preventing hotspotting is to reverse a fixed-width or numeric + row key so that the part that changes the most often (the most significant digit) is first. + This effectively randomizes row keys, but sacrifices row ordering properties. + See https://communities.intel.com/community/itpeernetwork/datastack/blog/2013/11/10/discussion-on-designing-hbase-tables. + and the discussion in the comments of HBASE-11682 for more + information about avoiding hotspotting. +
Monotonically Increasing Row Keys/Timeseries Data