HBase
  1. HBase
  2. HBASE-3147

Regions stuck in transition after rolling restart, perpetual timeout handling but nothing happens

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.90.0
    • Component/s: None
    • Labels:
      None

      Description

      The rolling restart script is great for bringing on the weird stuff. On my little loaded cluster if I run it, it horks the cluster and it doesn't recover. I notice two issues that need fixing:

      1. We'll miss noticing that a server was carrying .META. and it never gets assigned – the shutdown handlers get stuck in perpetual wait on a .META. assign that will never happen.
      2. Perpetual cycling of the this sequence per region not succesfully assigned:

       2010-10-23 21:37:57,404 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  usertable,user510588360,1287547556587.7f2d92497d2d03917afd574ea2aca55b. state=PENDING_OPEN,                       ts=1287869814294  45154 2010-10-23 21:37:57,404 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN or OPENING for too long, reassigning region=usertable,user510588360,1287547556587.                                     7f2d92497d2d03917afd574ea2aca55b.  45155 2010-10-23 21:37:57,404 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x2bd57d1475046a Attempting to transition node 7f2d92497d2d03917afd574ea2aca55b from RS_ZK_REGION_OPENING to M_ZK_REGION_OFFLINE  45156 2010-10-23 21:37:57,404 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x2bd57d1475046a Attempt to transition the unassigned node for 7f2d92497d2d03917afd574ea2aca55b from RS_ZK_REGION_OPENING to                 M_ZK_REGION_OFFLINE failed, the node existed but was in the state M_ZK_REGION_OFFLINE  45157 2010-10-23 21:37:57,404 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region transitioned OPENING to OFFLINE so skipping timeout, region=usertable,user510588360,1287547556587.7f2d92497d2d03917afd574ea2aca55b.  
      ,,,
      

      Timeout period again elapses an then same sequence.

      This is what I've been working on.

      1. HBASE-3147-v11.patch
        30 kB
        stack
      2. HBASE-3147-v6.patch
        26 kB
        stack

        Activity

        stack created issue -
        stack made changes -
        Field Original Value New Value
        Attachment HBASE-3147-v6.patch [ 12458024 ]
        stack made changes -
        Attachment HBASE-3147-v11.patch [ 12458083 ]
        stack made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        stack made changes -
        Assignee stack [ stack ]

          People

          • Assignee:
            stack
            Reporter:
            stack
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development