HBase
  1. HBase
  2. HBASE-3621

The timeout handler in AssignmentManager does an RPC while holding lock on RIT; a big no-no

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.90.2
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      J-D found this debugging a failure on Dmitriy's cluster; we're RPC'ing under a synchronized(regionsInTransition). Fix.

      1. 3621-v2.txt
        4 kB
        stack
      2. hbase-3621.txt
        2 kB
        Ted Yu

        Activity

        Hide
        Jean-Daniel Cryans added a comment -

        For example:

        "somenode.prod.twitter.com:60000.timeoutMonitor" daemon prio=10 tid=0x00002aacb8567800 nid=0x772 in Object.wait() [0x0000000045bf1000]
           java.lang.Thread.State: WAITING (on object monitor)
          at java.lang.Object.wait(Native Method)
          at java.lang.Object.wait(Object.java:485)
          at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
          - locked <0x00002aaab2a10da8> (a org.apache.hadoop.hbase.ipc.HBaseClient$Call)
          at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
          at $Proxy6.closeRegion(Unknown Source)
          at org.apache.hadoop.hbase.master.ServerManager.sendRegionClose(ServerManager.java:589)
          at org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1093)
          at org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor.chore(AssignmentManager.java:1672)
          - locked <0x00002aaabf759858> (a java.util.concurrent.ConcurrentSkipListMap)
          at org.apache.hadoop.hbase.Chore.run(Chore.java:66
        ...
        
        "main-EventThread" daemon prio=10 tid=0x00002aacb850b000 nid=0x761 waiting for monitor entry [0x00000000455eb000]
           java.lang.Thread.State: BLOCKED (on object monitor)
          at org.apache.hadoop.hbase.master.AssignmentManager.nodeDataChanged(AssignmentManager.java:525)
          - waiting to lock <0x00002aaabf759858> (a java.util.concurrent.ConcurrentSkipListMap)
          at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
          at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
          at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
        

        The ZK event thread is blocked by that other thread that talks to a RS that doesn't answer. All ZK events get severely delayed.

        Show
        Jean-Daniel Cryans added a comment - For example: "somenode.prod.twitter.com:60000.timeoutMonitor" daemon prio=10 tid=0x00002aacb8567800 nid=0x772 in Object .wait() [0x0000000045bf1000] java.lang. Thread .State: WAITING (on object monitor) at java.lang. Object .wait(Native Method) at java.lang. Object .wait( Object .java:485) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757) - locked <0x00002aaab2a10da8> (a org.apache.hadoop.hbase.ipc.HBaseClient$Call) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) at $Proxy6.closeRegion(Unknown Source) at org.apache.hadoop.hbase.master.ServerManager.sendRegionClose(ServerManager.java:589) at org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1093) at org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor.chore(AssignmentManager.java:1672) - locked <0x00002aaabf759858> (a java.util.concurrent.ConcurrentSkipListMap) at org.apache.hadoop.hbase.Chore.run(Chore.java:66 ... "main-EventThread" daemon prio=10 tid=0x00002aacb850b000 nid=0x761 waiting for monitor entry [0x00000000455eb000] java.lang. Thread .State: BLOCKED (on object monitor) at org.apache.hadoop.hbase.master.AssignmentManager.nodeDataChanged(AssignmentManager.java:525) - waiting to lock <0x00002aaabf759858> (a java.util.concurrent.ConcurrentSkipListMap) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) The ZK event thread is blocked by that other thread that talks to a RS that doesn't answer. All ZK events get severely delayed.
        Hide
        Ted Yu added a comment -

        unassign() has the following block which is before making RPC call:

            synchronized (regionsInTransition) {
            }
        

        so we don't need to enclose unassign() in the synchronized block of chore().

        Show
        Ted Yu added a comment - unassign() has the following block which is before making RPC call: synchronized (regionsInTransition) { } so we don't need to enclose unassign() in the synchronized block of chore().
        Hide
        Ted Yu added a comment -

        Initial attempt.
        I moved the processing of regions in PENDING_CLOSE state outside of synchronized block.

        Show
        Ted Yu added a comment - Initial attempt. I moved the processing of regions in PENDING_CLOSE state outside of synchronized block.
        Hide
        Ted Yu added a comment -

        assign() calls the following method:

                serverManager.sendRegionOpen(plan.getDestination(), state.getRegion());
        

        I think we should move call to assign() outside of

        synchronized (regionsInTransition) {
        }
        

        block.

        I noticed the call assign(regionState, false, true) at line 1790 doesn't obtain lock on regionState, inconsistent with the other calls to:

          private void assign(final RegionState state, final boolean setOfflineInZK,
              final boolean forceNewPlan) {
        
        Show
        Ted Yu added a comment - assign() calls the following method: serverManager.sendRegionOpen(plan.getDestination(), state.getRegion()); I think we should move call to assign() outside of synchronized (regionsInTransition) { } block. I noticed the call assign(regionState, false, true) at line 1790 doesn't obtain lock on regionState, inconsistent with the other calls to: private void assign( final RegionState state, final boolean setOfflineInZK, final boolean forceNewPlan) {
        Hide
        stack added a comment -

        This is what I committed. Its Teds patch plus I do same thing for assign (Teds patch covered unassigns happening under locks – I noticed that we are also assigning under locks so I copied Teds pattern moving assign out to run outside of the lock on regionsInTransition.

        Show
        stack added a comment - This is what I committed. Its Teds patch plus I do same thing for assign (Teds patch covered unassigns happening under locks – I noticed that we are also assigning under locks so I copied Teds pattern moving assign out to run outside of the lock on regionsInTransition.
        Hide
        stack added a comment -

        Committed to branch and trunk. Thanks for the patch Ted.

        Show
        stack added a comment - Committed to branch and trunk. Thanks for the patch Ted.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #1803 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1803/)

        Show
        Hudson added a comment - Integrated in HBase-TRUNK #1803 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1803/ )

          People

          • Assignee:
            Ted Yu
            Reporter:
            stack
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development