Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 0.92.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      I found this issue from a zknode which has existed for a long time in the unassigned parent.And HMaster report warnning log increasingly.The loop log is at below.

      WARN org.apache.hadoop.hbase.master.AssignmentManager: Region 1a1c950ad45812d7b4b9b90ebf268468 not found on server sev0040,60020,1350378314041; failed processing
      WARN org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for region 1a1c950ad45812d7b4b9b90ebf268468 from server sev0040,60020,1350378314041 but it doesn't exist anymore, probably already processed its split
      WARN org.apache.hadoop.hbase.master.AssignmentManager: Region 1a1c950ad45812d7b4b9b90ebf268468 not found on server gs-dpo-sev0040,60020,1350378314041; failed processing
      WARN org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for region 1a1c950ad45812d7b4b9b90ebf268468 from server sev0040,60020,1350378314041 but it doesn't exist anymore, probably already processed its split

      we use Hbase-0.92.1, and I trace back to the source code. HMaster AssignmentManager have already deleted the SPLIT_Region in its memory structure,but HRegionServer SplitTransaction has found the unassigned/parent-node existed in a transient state, precisely SplitTransaction executes tickleNodeSplit to update a new version a little later than AssignmentManager deleting unassigned/parent-znode. After updating a version of the znode, it will intrigue the handleRegion operation again, however, AssignmentManager assert that the RegionState in Memory has been deleted, and transaction goes into a retry loop.

      In the SplitTransaction, transitionZKNode will retry tickleNodeSplit after sleeping 100ms. In my opinion, if the time is much longger than 100ms, all the operation from AssignmentManagement will finish off completely.

        Activity

        Hide
        Ted Yu added a comment -

        The following code is from 0.94:

                  Thread.sleep(100);
                  // When this returns -1 it means the znode doesn't exist
                  this.znodeVersion = tickleNodeSplit(server.getZooKeeper(),
                    parent.getRegionInfo(), a.getRegionInfo(), b.getRegionInfo(),
                    server.getServerName(), this.znodeVersion);
                  spins++;
                } while (this.znodeVersion != -1 && !server.isStopped()
                    && !services.isStopping());
        

        In 0.92, condition for while loop is (this.znodeVersion != -1)
        Meaning, we check whether the znode exists before writing new version.

        Show
        Ted Yu added a comment - The following code is from 0.94: Thread .sleep(100); // When this returns -1 it means the znode doesn't exist this .znodeVersion = tickleNodeSplit(server.getZooKeeper(), parent.getRegionInfo(), a.getRegionInfo(), b.getRegionInfo(), server.getServerName(), this .znodeVersion); spins++; } while ( this .znodeVersion != -1 && !server.isStopped() && !services.isStopping()); In 0.92, condition for while loop is (this.znodeVersion != -1) Meaning, we check whether the znode exists before writing new version.
        Hide
        Lars Hofhansl added a comment -

        Assigning to 0.94.4 for now, since this is not new (I believe)

        Show
        Lars Hofhansl added a comment - Assigning to 0.94.4 for now, since this is not new (I believe)
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Ted

             this.znodeVersion = transitionNodeSplit(server.getZooKeeper(),
                  parent.getRegionInfo(), a.getRegionInfo(), b.getRegionInfo(),
                  server.getServerName(), this.znodeVersion);
        

        The znodeVersion is updated once the node is changed to Split for the first time. in both versions.

        Show
        ramkrishna.s.vasudevan added a comment - @Ted this .znodeVersion = transitionNodeSplit(server.getZooKeeper(), parent.getRegionInfo(), a.getRegionInfo(), b.getRegionInfo(), server.getServerName(), this .znodeVersion); The znodeVersion is updated once the node is changed to Split for the first time. in both versions.
        Hide
        ramkrishna.s.vasudevan added a comment -

        May be this and HBASE-7103 are different.

        Show
        ramkrishna.s.vasudevan added a comment - May be this and HBASE-7103 are different.
        Hide
        Bing Jiang added a comment -

        If znode change state from SPLITTING to SPLIT, please assert HRegionServer SplitTransaction wait a long time that HMaster-AssignmentManager has finished the clean-up work.Maybe it can add a ZooKeeperWatcher in the SplitTransaction.

        Show
        Bing Jiang added a comment - If znode change state from SPLITTING to SPLIT, please assert HRegionServer SplitTransaction wait a long time that HMaster-AssignmentManager has finished the clean-up work.Maybe it can add a ZooKeeperWatcher in the SplitTransaction.
        Hide
        Lars Hofhansl added a comment -

        Pulling back into 0.94.3, so that we at least have a look before 0.94.3 goes out.

        Show
        Lars Hofhansl added a comment - Pulling back into 0.94.3, so that we at least have a look before 0.94.3 goes out.
        Hide
        Lars Hofhansl added a comment -

        Any chance for a test for this, Bing?

        Show
        Lars Hofhansl added a comment - Any chance for a test for this, Bing?
        Hide
        Lars Hofhansl added a comment -

        Pushing to 0.94.4 after all.

        Show
        Lars Hofhansl added a comment - Pushing to 0.94.4 after all.
        Hide
        Dave Latham added a comment -

        We've run in to this a few times. What's the best workaround for this? So far we've been restarting the master process.

        Show
        Dave Latham added a comment - We've run in to this a few times. What's the best workaround for this? So far we've been restarting the master process.
        Hide
        Dave Latham added a comment -

        Correction, got my notes wrong. For this one, I restarted the master but that did not solve it. Manually going in to ZK and doing a rmr on the /hbase/unassigned regions did seem to solve it. Would that have any other unpleasant side effects?

        Show
        Dave Latham added a comment - Correction, got my notes wrong. For this one, I restarted the master but that did not solve it. Manually going in to ZK and doing a rmr on the /hbase/unassigned regions did seem to solve it. Would that have any other unpleasant side effects?
        Hide
        ramkrishna.s.vasudevan added a comment -

        As far i see if the split was successful and only the znodes were not cleared then the cleanup that you had done should do no harm. Are you using 0.92?

        Show
        ramkrishna.s.vasudevan added a comment - As far i see if the split was successful and only the znodes were not cleared then the cleanup that you had done should do no harm. Are you using 0.92?
        Hide
        Dave Latham added a comment -

        Yes, I'm using 0.92. I should note that until I intervened that the region server in question had it's split thread stuck waiting for the master to complete, and so it stopped processing other splits which eventually led to some huge regions and some other problems. Relevant regionserver log output like:

        2012-12-06 00:00:03,346 DEBUG org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on the master to process the split for 374f57cc18a7f8ee54b322350c009169
        2012-12-06 00:00:03,449 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13ae81c303ef0c2-0x13ae81c303ef0c2-0x13ae81c303ef0c2 Attempting to transition node 374f57cc18a7f8ee54b322350c009169 from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT
        2012-12-06 00:00:03,451 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13ae81c303ef0c2-0x13ae81c303ef0c2-0x13ae81c303ef0c2 Successfully transitioned node 374f57cc18a7f8ee54b322350c009169 from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT
        2012-12-06 00:00:03,553 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13ae81c303ef0c2-0x13ae81c303ef0c2-0x13ae81c303ef0c2 Attempting to transition node 374f57cc18a7f8ee54b322350c009169 from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT
        2012-12-06 00:00:03,554 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13ae81c303ef0c2-0x13ae81c303ef0c2-0x13ae81c303ef0c2 Successfully transitioned node 374f57cc18a7f8ee54b322350c009169 from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT
        
        Show
        Dave Latham added a comment - Yes, I'm using 0.92. I should note that until I intervened that the region server in question had it's split thread stuck waiting for the master to complete, and so it stopped processing other splits which eventually led to some huge regions and some other problems. Relevant regionserver log output like: 2012-12-06 00:00:03,346 DEBUG org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on the master to process the split for 374f57cc18a7f8ee54b322350c009169 2012-12-06 00:00:03,449 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13ae81c303ef0c2-0x13ae81c303ef0c2-0x13ae81c303ef0c2 Attempting to transition node 374f57cc18a7f8ee54b322350c009169 from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT 2012-12-06 00:00:03,451 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13ae81c303ef0c2-0x13ae81c303ef0c2-0x13ae81c303ef0c2 Successfully transitioned node 374f57cc18a7f8ee54b322350c009169 from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT 2012-12-06 00:00:03,553 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13ae81c303ef0c2-0x13ae81c303ef0c2-0x13ae81c303ef0c2 Attempting to transition node 374f57cc18a7f8ee54b322350c009169 from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT 2012-12-06 00:00:03,554 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13ae81c303ef0c2-0x13ae81c303ef0c2-0x13ae81c303ef0c2 Successfully transitioned node 374f57cc18a7f8ee54b322350c009169 from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT
        Hide
        Lars Hofhansl added a comment -

        Moving to 0.94.5, since we do not have a patch.

        Show
        Lars Hofhansl added a comment - Moving to 0.94.5, since we do not have a patch.
        Hide
        Lars Hofhansl added a comment -

        This is possibly a dup of HBASE-7551. The issue fixed there would lead to region begin stuck in SPLITTING forever (until a master restart at least), since the Assignment Manager would never learn about it.

        Show
        Lars Hofhansl added a comment - This is possibly a dup of HBASE-7551 . The issue fixed there would lead to region begin stuck in SPLITTING forever (until a master restart at least), since the Assignment Manager would never learn about it.
        Hide
        Lars Hofhansl added a comment -

        I think this is a dup.

        Show
        Lars Hofhansl added a comment - I think this is a dup.

          People

          • Assignee:
            Unassigned
            Reporter:
            Bing Jiang
          • Votes:
            1 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development