+ <para>Salting in this sense has nothing to do with cryptography, but refers to adding random
+ data to the start of a row key. In this case, salting refers to adding a prefix to the row
+ key to cause it to sort differently than it otherwise would. Salting can be helpful if you
+ have a few keys that come up over and over, along with other rows that don't fit those keys.
+ In that case, the regions holding rows with the "hot" keys would be overloaded, compared to
+ the other regions. Salting completely removes ordering, so is often a poorer choice than
+ hashing. Using totally random row keys for data which is accessed sequentially would remove
+ the benefit of HBase's row-sorting algorithm and cause very poor performance, as each get or
+ scan would need to query all regions.</para>
I don't think this salting example is correct about the ramifications. Both Nick and I agree that salting is puting some random value in front of the actual value. This means instead of one sorted list of entries, we'd have many n sorted lists of entries if the cardinality of the salt is n.
Example: naively we have rowkeys like this:
if we us a 4 way salt (a,b,c,d), we could end up with data resorted like this:
Let say we add some new values to row foo0003. It could get salted with a new salt, let's say 'c'.
To read we still could get things read in the original order but we'd have to have a reader starting from each salt in parallel to get the rows back in order. (and likely need to do some coalescing of foo0003 to combine the a-foo0003 and c-foo0003 rows back into one. The effect here in this situtation is that we could be writing with 4x the throughput now since we would be on 4 different machines.(assuming that the a, b, c, d are balanced onto different machines).
Nick's point of view (please correct me if I am wrong) says that you could "salt" the original row key with a one-way hash so that foo0003 would always get salted with 'a'. This would spread rowkeys that are lexicographically close (foo0001 and foo0002) to different machines that could help reduce contention and increase overall throughput but not allow ever allow a single row to have 4x the throughput like the other approach.
+ <para>Hashing refers to applying a random one-way function to the row key, such that a
+ particular row always gets the same arbitrary value applied. This preserves the sort order
+ so that scans are effective, but spreads out load across a region. One example where hashing
+ is the right strategy would be if for some reason, a large proportion of rows started with
+ the same letter. Normally, these would all be sorted into the same region. You can apply a
+ hash to artificially differentiate them and spread them out.</para>
Hashing actually totally trashes the sort order – in fact the goal of hashing is to evenly disburse entries that are near each other lexicographically as much as possible.