HBase
  1. HBase
  2. HBASE-3789

Cleanup the locking contention in the master

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.90.2
    • Fix Version/s: 0.92.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Hide
      The master now creates the znode when closing a region, which is an incompatible change.
      SplitTransaction now waits on the master to delete the znode that it created before it can finish.
      The master doesn't keep track of znodes being deleted and created anymore, it was getting out of sync too easily.
      Show
      The master now creates the znode when closing a region, which is an incompatible change. SplitTransaction now waits on the master to delete the znode that it created before it can finish. The master doesn't keep track of znodes being deleted and created anymore, it was getting out of sync too easily.

      Description

      The new master uses a lot of synchronized blocks to be safe, but it only takes a few jstacks to see that there's multiple layers of lock contention when a bunch of regions are moving (like when the balancer runs). The main culprits are regionInTransition in AssignmentManager, ZKAssign that uses ZKW.getZNnodes (basically another set of region in transitions), and locking at the RegionState level.

      My understanding is that even tho we have multiple threads to handle regions in transition, everything is actually serialized. Most of the time, lock holders are talking to ZK or a region server, which can take a few milliseconds.

      A simple example is when AssignmentManager wants to update the timers for all the regions on a RS, it will usually be waiting on another thread that's holding the lock while talking to ZK.

      1. HBASE-3789-trunk.patch
        36 kB
        Jean-Daniel Cryans
      2. HBASE-3789-v4-0.90.patch
        27 kB
        Jean-Daniel Cryans

        Issue Links

          Activity

          Hide
          Jean-Daniel Cryans added a comment -

          This first patch is a dirty work in progress.

          First thing I changed is in ZKW I used a ConcurrentSkipListSet instead of a HashSet which resulted in removing all the weird locking in ZKAssign.

          Next up is AssignmentManager where I removed all the sync done in nodeCreated/DataChanged/ChildrenChanged since it was already handled inside handleRegion(). Also it is doing ZK operations under that sync so it's double bad.

          Third I changed RegionState so that the stamp is atomically modifiable since it doesn't really matter that both the state and the stamp be changed by exactly the same person, what you want in the end is progress. This was also the source of a lot of locking in updateTimers.

          I tested this patch under load while killing region servers (multiple at a time), and then running the balancer. Didn't a single bug. Still needs more testing and need to document the locking and see if there's any other optimization that could be done.

          At least, my profiling shows that the master is now only waiting on ZK.

          Show
          Jean-Daniel Cryans added a comment - This first patch is a dirty work in progress. First thing I changed is in ZKW I used a ConcurrentSkipListSet instead of a HashSet which resulted in removing all the weird locking in ZKAssign. Next up is AssignmentManager where I removed all the sync done in nodeCreated/DataChanged/ChildrenChanged since it was already handled inside handleRegion(). Also it is doing ZK operations under that sync so it's double bad. Third I changed RegionState so that the stamp is atomically modifiable since it doesn't really matter that both the state and the stamp be changed by exactly the same person, what you want in the end is progress. This was also the source of a lot of locking in updateTimers. I tested this patch under load while killing region servers (multiple at a time), and then running the balancer. Didn't a single bug. Still needs more testing and need to document the locking and see if there's any other optimization that could be done. At least, my profiling shows that the master is now only waiting on ZK.
          Hide
          Jean-Daniel Cryans added a comment -

          I might also add that with this patch, when it usually took 25 seconds to run the balancer command it now returns under 1 second.

          Show
          Jean-Daniel Cryans added a comment - I might also add that with this patch, when it usually took 25 seconds to run the balancer command it now returns under 1 second.
          Hide
          Ted Yu added a comment -

          Good job J-D
          Can you disclose roughly how many regions were moved during the 1 second balancing ?

          Show
          Ted Yu added a comment - Good job J-D Can you disclose roughly how many regions were moved during the 1 second balancing ?
          Hide
          Jean-Daniel Cryans added a comment -

          The balancer only sends the unassign messages before returning. About 500. The total time to balance is halved.

          Show
          Jean-Daniel Cryans added a comment - The balancer only sends the unassign messages before returning. About 500. The total time to balance is halved.
          Hide
          stack added a comment -

          You should remove rather than comment out code.

          There is more to be done on this still, right J-D?

          Show
          stack added a comment - You should remove rather than comment out code. There is more to be done on this still, right J-D?
          Hide
          Jean-Daniel Cryans added a comment -

          Yes, like I wrote in the first comment it's a dirty WIP.

          Show
          Jean-Daniel Cryans added a comment - Yes, like I wrote in the first comment it's a dirty WIP.
          Hide
          Jean-Daniel Cryans added a comment -

          There's one major issue with my current patch and it's that there's a race between the master's OpenedRegionHandler and the events thread. It goes like this:

          • RS transitions a region to OPENING
          • RS transitions again to OPENING
          • Master receives the first event, reads ZK and sees OPENING
          • RS transitions to OPENED
          • Master receives the second event, reads ZK and sees OPENED instead of OPENING, kicks of the OpenedRegionHandler
          • The handler will at some point delete the znode in the ZKW.getNodes structure (such a bad method name) before deleting the actual znode
          • Master receives the third event, reads ZK, sees OPENED but finds that getNodes doesn't contain the znode and considers this as a new region in transition so it adds it back in getNodes()
          • The handler deletes the znode
          • The Master does a no-op when it figures it cannot transition from OPEN to OPENED

          At this point the region is assigned and everything is "fine"... until the master decides for any reason to unassign the region. It sends the unassignment, receives an event but doesn't process it in nodeChildrenChanged because ZKW.getNodes() already has it. From the point the master will spin in "Region has been PENDING_CLOSE for too long" until it's put out of its misery.

          The issue here is that the region server is creating the unassigned znode by itself, unlike an assignment where it's the master that does it. Doing that in the master won't fully solve the issue tho because in 0.92 the RS still create znodes for splits and there's no way that could be done by the master is it would be basically like returning back to how it used to work.

          So this is what Stack and I thought about:

          • The master needs to create the unassigned znode before telling a RS to close a region, the RS will now just update it
          • ZKW needs to stop keeping track of the znodes, getting into a situation where we have a mismatch is too easy
          • The SplitTransaction will still create the znode, but it will then wait at the very end until it gets deleted by the master. To make sure the master sees the change, it will tickle the znode like we do for OPENING so that the master doesn't miss it
          • The method AssignmentManager.nodeChildrenChanged will only put watchers on znodes and won't keep track of anything
          Show
          Jean-Daniel Cryans added a comment - There's one major issue with my current patch and it's that there's a race between the master's OpenedRegionHandler and the events thread. It goes like this: RS transitions a region to OPENING RS transitions again to OPENING Master receives the first event, reads ZK and sees OPENING RS transitions to OPENED Master receives the second event, reads ZK and sees OPENED instead of OPENING, kicks of the OpenedRegionHandler The handler will at some point delete the znode in the ZKW.getNodes structure (such a bad method name) before deleting the actual znode Master receives the third event, reads ZK, sees OPENED but finds that getNodes doesn't contain the znode and considers this as a new region in transition so it adds it back in getNodes() The handler deletes the znode The Master does a no-op when it figures it cannot transition from OPEN to OPENED At this point the region is assigned and everything is "fine"... until the master decides for any reason to unassign the region. It sends the unassignment, receives an event but doesn't process it in nodeChildrenChanged because ZKW.getNodes() already has it. From the point the master will spin in "Region has been PENDING_CLOSE for too long" until it's put out of its misery. The issue here is that the region server is creating the unassigned znode by itself, unlike an assignment where it's the master that does it. Doing that in the master won't fully solve the issue tho because in 0.92 the RS still create znodes for splits and there's no way that could be done by the master is it would be basically like returning back to how it used to work. So this is what Stack and I thought about: The master needs to create the unassigned znode before telling a RS to close a region, the RS will now just update it ZKW needs to stop keeping track of the znodes, getting into a situation where we have a mismatch is too easy The SplitTransaction will still create the znode, but it will then wait at the very end until it gets deleted by the master. To make sure the master sees the change, it will tickle the znode like we do for OPENING so that the master doesn't miss it The method AssignmentManager.nodeChildrenChanged will only put watchers on znodes and won't keep track of anything
          Hide
          Jean-Daniel Cryans added a comment -

          New cleaned up patch that does what I listed in the previous comment, except the splitting part since this is currently a patch for 0.90

          My testing shows that creating the znodes when closing in the master is slower since, duh, it's not done in parallel by the region servers. The patch is still much faster than the current master and people tweaking the number of handlers higher should see good speedups. I haven't seen any bug with this patch.

          Up next is more testing and porting to trunk with the handling of splits.

          Show
          Jean-Daniel Cryans added a comment - New cleaned up patch that does what I listed in the previous comment, except the splitting part since this is currently a patch for 0.90 My testing shows that creating the znodes when closing in the master is slower since, duh, it's not done in parallel by the region servers. The patch is still much faster than the current master and people tweaking the number of handlers higher should see good speedups. I haven't seen any bug with this patch. Up next is more testing and porting to trunk with the handling of splits.
          Hide
          Jean-Daniel Cryans added a comment -

          With the previous patch all the tests passed except for hbck. Looking deeper, I see hbck creates it's own znodes so now the master doesn't see that. It's not clear to my why it's not using HBA.assign instead of the trickery with the HBCK_CODE_NAME.

          This patch modifies hbck so that it uses "normal" tools provided by the master instead of bypassing it.

          I'm also working on porting that to trunk. I got the previous patch I posted working but didn't do the hbck stuff yet because it's different.

          Also I still didn't touch the splitting code in trunk.

          Show
          Jean-Daniel Cryans added a comment - With the previous patch all the tests passed except for hbck. Looking deeper, I see hbck creates it's own znodes so now the master doesn't see that. It's not clear to my why it's not using HBA.assign instead of the trickery with the HBCK_CODE_NAME. This patch modifies hbck so that it uses "normal" tools provided by the master instead of bypassing it. I'm also working on porting that to trunk. I got the previous patch I posted working but didn't do the hbck stuff yet because it's different. Also I still didn't touch the splitting code in trunk.
          Hide
          Jean-Daniel Cryans added a comment -

          Patch without the conf/ stuff in it (duh).

          Show
          Jean-Daniel Cryans added a comment - Patch without the conf/ stuff in it (duh).
          Hide
          Jean-Daniel Cryans added a comment -

          I'm posting the equivalent to the v3 patch for trunk. I still need to handle the splitting.

          Show
          Jean-Daniel Cryans added a comment - I'm posting the equivalent to the v3 patch for trunk. I still need to handle the splitting.
          Hide
          Jean-Daniel Cryans added a comment -

          Patch for trunk with the split fixes. I had to remove a test because it wasn't an issue anymore (the master now creates the znode when closing a region), then I had to do a bunch of fixes for AssignmentManager for cases when we report regions that are already split or skipped steps and finally I added the part of the code that waits for the master to delete the znode.

          One thing I might do further cleanup on is the latter part of SplitTransaction that has a few methods that all look the same. Also I'm not thrilled having to do a sleep to wait on the master, but that was the easiest way.

          Show
          Jean-Daniel Cryans added a comment - Patch for trunk with the split fixes. I had to remove a test because it wasn't an issue anymore (the master now creates the znode when closing a region), then I had to do a bunch of fixes for AssignmentManager for cases when we report regions that are already split or skipped steps and finally I added the part of the code that waits for the master to delete the znode. One thing I might do further cleanup on is the latter part of SplitTransaction that has a few methods that all look the same. Also I'm not thrilled having to do a sleep to wait on the master, but that was the easiest way.
          Hide
          stack added a comment -

          I love this patch. +1

          Show
          stack added a comment - I love this patch. +1
          Hide
          Jean-Daniel Cryans added a comment -

          Thanks Stack for the +1.

          Also it's important to note that the 0.90 patch breaks rolling restarts because of the way it changes the closing sequence (which is why this is targeted for 0.92). Apply at your own risk

          Show
          Jean-Daniel Cryans added a comment - Thanks Stack for the +1. Also it's important to note that the 0.90 patch breaks rolling restarts because of the way it changes the closing sequence (which is why this is targeted for 0.92). Apply at your own risk
          Hide
          Jean-Daniel Cryans added a comment -

          Committed to trunk, thanks for the review Stack.

          Show
          Jean-Daniel Cryans added a comment - Committed to trunk, thanks for the review Stack.
          Hide
          Jean-Daniel Cryans added a comment -

          Attaching the 0.90 patch refreshed for that branch. It's not meant for inclusion, leaving it here if it's of use to anyone. Remember it breaks rolling restarts if you plan on deploying it.

          Show
          Jean-Daniel Cryans added a comment - Attaching the 0.90 patch refreshed for that branch. It's not meant for inclusion, leaving it here if it's of use to anyone. Remember it breaks rolling restarts if you plan on deploying it.
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #1976 (See https://builds.apache.org/job/HBase-TRUNK/1976/)

          Show
          Hudson added a comment - Integrated in HBase-TRUNK #1976 (See https://builds.apache.org/job/HBase-TRUNK/1976/ )

            People

            • Assignee:
              Jean-Daniel Cryans
              Reporter:
              Jean-Daniel Cryans
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development