HBase
  1. HBase
  2. HBASE-2332

Remove client-exposed row locks from region server

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Client, regionserver
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change

      Description

      Discussion in HBASE-2294 has surfaced that the client-exposed row lock feature in the HBase API may not be scalable/necessary. Additionally there are some benefits we can reap by removing the feature (or pushing it into the client).

        Issue Links

          Activity

          Hide
          stack added a comment -

          Resolving as duplicate of HBASE-7315 where rowlocks are removed from the API.

          Show
          stack added a comment - Resolving as duplicate of HBASE-7315 where rowlocks are removed from the API.
          Hide
          stack added a comment -

          Scratch that. Need to be able to point at an alternative before can deprecate. Moving out of 0.92 again.

          Show
          stack added a comment - Scratch that. Need to be able to point at an alternative before can deprecate. Moving out of 0.92 again.
          Hide
          stack added a comment -

          Lets get the deprecated flags added at least.

          Show
          stack added a comment - Lets get the deprecated flags added at least.
          Hide
          stack added a comment -

          Moving out of 0.92.0. Pull it back in if you think different.

          Show
          stack added a comment - Moving out of 0.92.0. Pull it back in if you think different.
          Hide
          Michael Dalton added a comment -

          Here's a rough sketch of the algorithm, sorry if it's a bit verbose. I take into account that row locks are per-RS. To give you a quick overview, we support reading, updating, and renaming a value V in a row. Renames move value V from one row to another. All reads occur without row locks (unless a rename failure occurs), and all updates (put/delete), which only affect a single row, use HBase CAS. The only situation where I need locking is to deal with renames and rename failure recovery. I need rename to appear to be atomic from the client's perspective, but I can perform rename failure recovery so long as it appears transparent.

          Rename of value V from row A to row B is accomplished with 5 steps: (1) acquire row locks on A and B in row key order (so that a deadly embrace does not occur where another client locks B and then A) (2) Update value V in row A using CAS with V~src~ which marks V as a rename source and includes the row key of B as the destination (3) Update B using CAS (using an empty value as the comparison value) with a modified copy of V, V~dst~, which marks B as the destination(and includes A's row key as source) (4) Delete V from Row A (5) Put original value V in Row B (this time not marked as a rename source or destination, just the exact original bytes). In steps 2-5, each of the CAS, Delete, and Put operations uses the appropriate row lock, so row lock A is used for any modifications to row A.

          At any one of these 5 steps row locks may be lost or the machine performing the rename may die. This will leave one or both of the rename source/destination in the 'pending rename' state. Thus we must recover from any prefix of the rename operations executed by a failed rename, which occurs when we read a value that is marked as pending rename source (V~src~) or pending rename destination(V~dst~).
          Initial reads occur without locks. However, reads are not allowed to return 'pending rename' values. Updates (Put/Delete) are not affected by renames, as all updates use CAS, and no 'read value' operation is permitted to return a 'pending rename' value, and thus no update will successfully modify a value marked 'pending rename' as the CAS will fail.

          If a pending rename value is encountered, the read must recover from the rename failure. The recovery method acquires locks on A and B (which it can do because pending values include the row key of the 'other' row, i.e. V~src~ includes B's row key). Then the recovery process performs an undo or redo depending on if V~dst~ was written, undoing if V~dst~ was never written by writing the original value V to row A, and re-doing the rename by re-executing steps (4)-(5) of the rename otherwise. In either case, all modifications are performed with the appropriate row lock.
          The critical benefit locks provide is that once the rename recovery procedure has acquired row locks on A and B, it is guaranteed that the original rename procedure (or previous failed rename recovery procedures) will never 'wake up' later and successfully perform mutations related to the failed rename (as their locks will have expired). I can provide more details about this algorithm in general if the above is unclear.

          This algorithm seems to work correctly given that locks may be lost at any time. If I understand what you said correctly, the major issue is that locks are not persisted in HLog currently. However, the only risk presented by lack of lock persistence is locks may be lost before lease expiration, correct? However, it is still the case that at most one client can lock at a row – it's just that a client may lose their lock even before the lease expires due to lack of HLog persistence (and all future operations performed with the 'lost' lock will fail).

          I don't see this as a huge drawback as anyone dealing with a critical section guarded by HBase row locks must handle the case that locks may be lost because locks have leases that may expire. Thus any critical section may fail at an arbitrary point and complete only a prefix of the operations in the critical section. From a correctness perspective, ensuring that programs transparently handle lock failure due to lease expiration should also address losing locks due to lack of persistence in the HLog. Please correct me if I'm missing something here concerning the HLog durability, or if the effect differs from what I've described..

          Show
          Michael Dalton added a comment - Here's a rough sketch of the algorithm, sorry if it's a bit verbose. I take into account that row locks are per-RS. To give you a quick overview, we support reading, updating, and renaming a value V in a row. Renames move value V from one row to another. All reads occur without row locks (unless a rename failure occurs), and all updates (put/delete), which only affect a single row, use HBase CAS. The only situation where I need locking is to deal with renames and rename failure recovery. I need rename to appear to be atomic from the client's perspective, but I can perform rename failure recovery so long as it appears transparent. Rename of value V from row A to row B is accomplished with 5 steps: (1) acquire row locks on A and B in row key order (so that a deadly embrace does not occur where another client locks B and then A) (2) Update value V in row A using CAS with V~src~ which marks V as a rename source and includes the row key of B as the destination (3) Update B using CAS (using an empty value as the comparison value) with a modified copy of V, V~dst~, which marks B as the destination(and includes A's row key as source) (4) Delete V from Row A (5) Put original value V in Row B (this time not marked as a rename source or destination, just the exact original bytes). In steps 2-5, each of the CAS, Delete, and Put operations uses the appropriate row lock, so row lock A is used for any modifications to row A. At any one of these 5 steps row locks may be lost or the machine performing the rename may die. This will leave one or both of the rename source/destination in the 'pending rename' state. Thus we must recover from any prefix of the rename operations executed by a failed rename, which occurs when we read a value that is marked as pending rename source (V~src~) or pending rename destination(V~dst~). Initial reads occur without locks. However, reads are not allowed to return 'pending rename' values. Updates (Put/Delete) are not affected by renames, as all updates use CAS, and no 'read value' operation is permitted to return a 'pending rename' value, and thus no update will successfully modify a value marked 'pending rename' as the CAS will fail. If a pending rename value is encountered, the read must recover from the rename failure. The recovery method acquires locks on A and B (which it can do because pending values include the row key of the 'other' row, i.e. V~src~ includes B's row key). Then the recovery process performs an undo or redo depending on if V~dst~ was written, undoing if V~dst~ was never written by writing the original value V to row A, and re-doing the rename by re-executing steps (4)-(5) of the rename otherwise. In either case, all modifications are performed with the appropriate row lock. The critical benefit locks provide is that once the rename recovery procedure has acquired row locks on A and B, it is guaranteed that the original rename procedure (or previous failed rename recovery procedures) will never 'wake up' later and successfully perform mutations related to the failed rename (as their locks will have expired). I can provide more details about this algorithm in general if the above is unclear. This algorithm seems to work correctly given that locks may be lost at any time. If I understand what you said correctly, the major issue is that locks are not persisted in HLog currently. However, the only risk presented by lack of lock persistence is locks may be lost before lease expiration, correct? However, it is still the case that at most one client can lock at a row – it's just that a client may lose their lock even before the lease expires due to lack of HLog persistence (and all future operations performed with the 'lost' lock will fail). I don't see this as a huge drawback as anyone dealing with a critical section guarded by HBase row locks must handle the case that locks may be lost because locks have leases that may expire. Thus any critical section may fail at an arbitrary point and complete only a prefix of the operations in the critical section. From a correctness perspective, ensuring that programs transparently handle lock failure due to lease expiration should also address losing locks due to lack of persistence in the HLog. Please correct me if I'm missing something here concerning the HLog durability, or if the effect differs from what I've described..
          Hide
          Todd Lipcon added a comment -

          So if a client holds a lock, blocks for some reason (swapping, etc) that is taking an indeterminate amount of time, and then loses its lock lease, the next lock holder is guaranteed that all future actions taken by the previous holder using the expired lock will fail

          Only on that particular row, though, since the leases are per-regionserver.

          Is there an objection to just making row lock acquisition non-blocking and implementing all the blocking/retry/backoff client-side?

          I definitely agree that this solves much of the scalability issue. Some of the other issues above still remain, though - for example, we don't currently persist lock acquisition/release in the HLog, so if the region moves to a new server, everyone loses their locks.

          Regarding the use case of moving rows, what does your lock based algorithm look like? Will it proceed correctly in the case of RS failure, loss of lease at arbitrary points, and failure of client?

          Show
          Todd Lipcon added a comment - So if a client holds a lock, blocks for some reason (swapping, etc) that is taking an indeterminate amount of time, and then loses its lock lease, the next lock holder is guaranteed that all future actions taken by the previous holder using the expired lock will fail Only on that particular row, though, since the leases are per-regionserver. Is there an objection to just making row lock acquisition non-blocking and implementing all the blocking/retry/backoff client-side? I definitely agree that this solves much of the scalability issue. Some of the other issues above still remain, though - for example, we don't currently persist lock acquisition/release in the HLog, so if the region moves to a new server, everyone loses their locks. Regarding the use case of moving rows, what does your lock based algorithm look like? Will it proceed correctly in the case of RS failure, loss of lease at arbitrary points, and failure of client?
          Hide
          Michael Dalton added a comment -

          One issue with implementing row locks client-side with CAS is that row locks have leases. So if a client holds a lock, blocks for some reason (swapping, etc) that is taking an indeterminate amount of time, and then loses its lock lease, the next lock holder is guaranteed that all future actions taken by the previous holder using the expired lock will fail. This is very useful when performing rename-type operations (e.g., moving a value from row A to row B) where you want the deletion of the source and the creation of the destination value to appear atomic (and you must recover from mid-rename failures). I currently use row locks for a few rename-style operations, and rename-failure-recovery.

          This could be emulated in CAS by requiring each row to hold a generation count or some equivalent, and then incrementing a generation count using CAS every time the row is mutated as a form of optimistic concurrency control. However, this would require each row to maintain persistent generation counts even for deleted rows in the general case, causing deleted rows to consume storage space.

          Is there an objection to just making row lock acquisition non-blocking and implementing all the blocking/retry/backoff client-side? This should be backwards compatible with existing APIs. I've started work on this at HBASE-2584. If there is interest I would expand further on the patch as needed. Otherwise, advisory locks will solve the generation count issue I outlined above, at least for all of my needs, as long as they use leases and that they ensure all operations performed using a lease-expired lock automatically fail.

          Show
          Michael Dalton added a comment - One issue with implementing row locks client-side with CAS is that row locks have leases. So if a client holds a lock, blocks for some reason (swapping, etc) that is taking an indeterminate amount of time, and then loses its lock lease, the next lock holder is guaranteed that all future actions taken by the previous holder using the expired lock will fail. This is very useful when performing rename-type operations (e.g., moving a value from row A to row B) where you want the deletion of the source and the creation of the destination value to appear atomic (and you must recover from mid-rename failures). I currently use row locks for a few rename-style operations, and rename-failure-recovery. This could be emulated in CAS by requiring each row to hold a generation count or some equivalent, and then incrementing a generation count using CAS every time the row is mutated as a form of optimistic concurrency control. However, this would require each row to maintain persistent generation counts even for deleted rows in the general case, causing deleted rows to consume storage space. Is there an objection to just making row lock acquisition non-blocking and implementing all the blocking/retry/backoff client-side? This should be backwards compatible with existing APIs. I've started work on this at HBASE-2584 . If there is interest I would expand further on the patch as needed. Otherwise, advisory locks will solve the generation count issue I outlined above, at least for all of my needs, as long as they use leases and that they ensure all operations performed using a lease-expired lock automatically fail.
          Hide
          stack added a comment -

          Moved from 0.21 to 0.22 just after merge of old 0.20 branch into TRUNK.

          Show
          stack added a comment - Moved from 0.21 to 0.22 just after merge of old 0.20 branch into TRUNK.
          Hide
          Jonathan Gray added a comment -

          +1 I like the idea of retaining some type of client-exposed lock, but making them work so that they only work in conjunction with other client locks of the same type.

          Show
          Jonathan Gray added a comment - +1 I like the idea of retaining some type of client-exposed lock, but making them work so that they only work in conjunction with other client locks of the same type.
          Hide
          Todd Lipcon added a comment -

          So it would be strictly between the clients and a coordinator like zookeeper, not involving the RS side at all.

          I think my response to that comment still applies - the locking has to be located at the storage layer to be useful (otherwise you don't know if you will still have the ZK lock by the time your mutation hits the RS).

          One thought is that we would provide the locking as an optional flag on single-row mutations - ie I can advisory-lock a row, and then do a get/put on that row with a flag that says "only succeed this put if I still own the lock on the row." So the use case of "lock, get, put, unlock" is still possible, assuming all users of the row are using this API. We can just skip providing this API for bulk puts due to the issues raised in this comment

          Show
          Todd Lipcon added a comment - So it would be strictly between the clients and a coordinator like zookeeper, not involving the RS side at all. I think my response to that comment still applies - the locking has to be located at the storage layer to be useful (otherwise you don't know if you will still have the ZK lock by the time your mutation hits the RS). One thought is that we would provide the locking as an optional flag on single-row mutations - ie I can advisory-lock a row, and then do a get/put on that row with a flag that says "only succeed this put if I still own the lock on the row." So the use case of "lock, get, put, unlock" is still possible, assuming all users of the row are using this API. We can just skip providing this API for bulk puts due to the issues raised in this comment
          Hide
          Jean-Daniel Cryans added a comment -

          they wouldn't block other clients from doing mutations, unless the other clients were also taking locks on these rows.

          That's what I had in mind when I commented on HBASE-2294:

          <crazyidea>If we remove row locks from core HBase, maybe we should still support them at the HTable level but by using ZK directly instead of contacting region servers.</crazyidea>

          So it would be strictly between the clients and a coordinator like zookeeper, not involving the RS side at all.

          Show
          Jean-Daniel Cryans added a comment - they wouldn't block other clients from doing mutations, unless the other clients were also taking locks on these rows. That's what I had in mind when I commented on HBASE-2294 : <crazyidea>If we remove row locks from core HBase, maybe we should still support them at the HTable level but by using ZK directly instead of contacting region servers.</crazyidea> So it would be strictly between the clients and a coordinator like zookeeper, not involving the RS side at all.
          Hide
          Todd Lipcon added a comment -

          Another idea here suggested by a coworker of mine is to make row locks advisory, similar to flock(2). These locks would be entirely decoupled from the internal row locks used by HBase - that is to say, they wouldn't block other clients from doing mutations, unless the other clients were also taking locks on these rows.

          We could even offer shared/exclusive locks if we wanted to get fancy here

          Show
          Todd Lipcon added a comment - Another idea here suggested by a coworker of mine is to make row locks advisory, similar to flock(2). These locks would be entirely decoupled from the internal row locks used by HBase - that is to say, they wouldn't block other clients from doing mutations, unless the other clients were also taking locks on these rows. We could even offer shared/exclusive locks if we wanted to get fancy here
          Hide
          Andrew Purtell added a comment -

          Emulation of current row lock API with CAS is fine, but we do need something. For example today I needed an incrementColumnValue op that always adjusts timestamp. ICV won't modify timestamps while the value is in memstore. So I just used a get and put within a rowlock instead.

          Show
          Andrew Purtell added a comment - Emulation of current row lock API with CAS is fine, but we do need something. For example today I needed an incrementColumnValue op that always adjusts timestamp. ICV won't modify timestamps while the value is in memstore. So I just used a get and put within a rowlock instead.
          Hide
          Todd Lipcon added a comment -

          To summarize some of the thoughts from the other JIRA (plus some conversation Ryan and I have been having):

          • Row locks provided by HBase are not very scalable, since clients blocked on locks tie up valuable regionserver threads.
          • Row locks may not be correct in the current implementation, since they aren't logged in the HLog and thus don't survive region failover
          • We provide a compare-and-swap primitive, which is sufficient to achieve the same effect as row locks from the client side (using a spin lock pattern)

          We'll still need row locking internally to achieve atomic operations (see HBASE-2294 for more details). However, these should be a private implementation detail.

          Some benefits we can get by dropping the external exposure of the row locks:

          • The guarantees determined in HBASE-2294 may be achievable through more efficient means - eg lock free data structures, MVCC, etc.
          • Externally exposed row locks add complexity both to implementation and APIs (eg being able to specify an existing row lock id to other mutations)
          • Though we still need internal row locks, we can most likely get away with multiplexing multiple rows to a single lock. For example, we can have a few hundred locks in an array, then hash rows onto that array. Since internal locks are only held very shortly (enough time to mutate the memstore and sync), and only one row at a time, this should be safe and more memory-efficient.

          If the row lock API is deemed important for users, we can implement it client-side using the CAS pattern. To make it more efficient, we could add a signaling mechanism - more details on this in the future if it comes up.

          Show
          Todd Lipcon added a comment - To summarize some of the thoughts from the other JIRA (plus some conversation Ryan and I have been having): Row locks provided by HBase are not very scalable, since clients blocked on locks tie up valuable regionserver threads. Row locks may not be correct in the current implementation, since they aren't logged in the HLog and thus don't survive region failover We provide a compare-and-swap primitive, which is sufficient to achieve the same effect as row locks from the client side (using a spin lock pattern) We'll still need row locking internally to achieve atomic operations (see HBASE-2294 for more details). However, these should be a private implementation detail. Some benefits we can get by dropping the external exposure of the row locks: The guarantees determined in HBASE-2294 may be achievable through more efficient means - eg lock free data structures, MVCC, etc. Externally exposed row locks add complexity both to implementation and APIs (eg being able to specify an existing row lock id to other mutations) Though we still need internal row locks, we can most likely get away with multiplexing multiple rows to a single lock. For example, we can have a few hundred locks in an array, then hash rows onto that array. Since internal locks are only held very shortly (enough time to mutate the memstore and sync), and only one row at a time, this should be safe and more memory-efficient. If the row lock API is deemed important for users, we can implement it client-side using the CAS pattern. To make it more efficient, we could add a signaling mechanism - more details on this in the future if it comes up.

            People

            • Assignee:
              Unassigned
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development