Here's a rough sketch of the algorithm, sorry if it's a bit verbose. I take into account that row locks are per-RS. To give you a quick overview, we support reading, updating, and renaming a value V in a row. Renames move value V from one row to another. All reads occur without row locks (unless a rename failure occurs), and all updates (put/delete), which only affect a single row, use HBase CAS. The only situation where I need locking is to deal with renames and rename failure recovery. I need rename to appear to be atomic from the client's perspective, but I can perform rename failure recovery so long as it appears transparent.
Rename of value V from row A to row B is accomplished with 5 steps: (1) acquire row locks on A and B in row key order (so that a deadly embrace does not occur where another client locks B and then A) (2) Update value V in row A using CAS with V~src~ which marks V as a rename source and includes the row key of B as the destination (3) Update B using CAS (using an empty value as the comparison value) with a modified copy of V, V~dst~, which marks B as the destination(and includes A's row key as source) (4) Delete V from Row A (5) Put original value V in Row B (this time not marked as a rename source or destination, just the exact original bytes). In steps 2-5, each of the CAS, Delete, and Put operations uses the appropriate row lock, so row lock A is used for any modifications to row A.
At any one of these 5 steps row locks may be lost or the machine performing the rename may die. This will leave one or both of the rename source/destination in the 'pending rename' state. Thus we must recover from any prefix of the rename operations executed by a failed rename, which occurs when we read a value that is marked as pending rename source (V~src~) or pending rename destination(V~dst~).
Initial reads occur without locks. However, reads are not allowed to return 'pending rename' values. Updates (Put/Delete) are not affected by renames, as all updates use CAS, and no 'read value' operation is permitted to return a 'pending rename' value, and thus no update will successfully modify a value marked 'pending rename' as the CAS will fail.
If a pending rename value is encountered, the read must recover from the rename failure. The recovery method acquires locks on A and B (which it can do because pending values include the row key of the 'other' row, i.e. V~src~ includes B's row key). Then the recovery process performs an undo or redo depending on if V~dst~ was written, undoing if V~dst~ was never written by writing the original value V to row A, and re-doing the rename by re-executing steps (4)-(5) of the rename otherwise. In either case, all modifications are performed with the appropriate row lock.
The critical benefit locks provide is that once the rename recovery procedure has acquired row locks on A and B, it is guaranteed that the original rename procedure (or previous failed rename recovery procedures) will never 'wake up' later and successfully perform mutations related to the failed rename (as their locks will have expired). I can provide more details about this algorithm in general if the above is unclear.
This algorithm seems to work correctly given that locks may be lost at any time. If I understand what you said correctly, the major issue is that locks are not persisted in HLog currently. However, the only risk presented by lack of lock persistence is locks may be lost before lease expiration, correct? However, it is still the case that at most one client can lock at a row – it's just that a client may lose their lock even before the lease expires due to lack of HLog persistence (and all future operations performed with the 'lost' lock will fail).
I don't see this as a huge drawback as anyone dealing with a critical section guarded by HBase row locks must handle the case that locks may be lost because locks have leases that may expire. Thus any critical section may fail at an arbitrary point and complete only a prefix of the operations in the critical section. From a correctness perspective, ensuring that programs transparently handle lock failure due to lease expiration should also address losing locks due to lack of persistence in the HLog. Please correct me if I'm missing something here concerning the HLog durability, or if the effect differs from what I've described..