Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: master, regionserver
    • Labels:
      None

      Description

      If we restart master when it is disabling one table, the table will be disabling forever.

      In current logic, Region CLOSE RPC will always returned NotServingRegionException because RS has already closed the region before we restart master. So table will be disabling forever because the region will in RIT all along.

      In another case, when AssignmentManager#rebuildUserRegions(), it will put parent regions to AssignmentManager.regions, so we can't close these parent regions until it is purged by CatalogJanitor if we execute disabling the table.

      1. HBASE-5571v3.patch
        4 kB
        chunhui shen
      2. BASE-5571v2.patch
        4 kB
        chunhui shen
      3. HBASE-5571.patch
        8 kB
        chunhui shen

        Activity

        Hide
        stack added a comment -

        This patch has much merit. Thanks for digging in on it.

        I like how you made a method cancelClosingRegionIfDisabling to hold a bunch of code that was inside in a catch block.

        I see why you want to know if a region is splitting – it makes it so we can remove the comments where we speculate a region is splitting – but I don't think keeping a list of outstaning regions in HRS the right place for it ... in particular I don't think we should keep this state in the OnlineRegions Interface (splitting regions are not online). When we go to split a region, we put this fact up into zk. Why not check there rather than trying to keep around a collection of splitting regions?

        Show
        stack added a comment - This patch has much merit. Thanks for digging in on it. I like how you made a method cancelClosingRegionIfDisabling to hold a bunch of code that was inside in a catch block. I see why you want to know if a region is splitting – it makes it so we can remove the comments where we speculate a region is splitting – but I don't think keeping a list of outstaning regions in HRS the right place for it ... in particular I don't think we should keep this state in the OnlineRegions Interface (splitting regions are not online). When we go to split a region, we put this fact up into zk. Why not check there rather than trying to keep around a collection of splitting regions?
        Hide
        stack added a comment -

        Oh, any chance of a test?

        Show
        stack added a comment - Oh, any chance of a test?
        Hide
        chunhui shen added a comment -

        patch v2, check region is splitting through AssignmentManager#RegionState,
        Also I fix another a bug when region is disabling and split is rolled back, in current logic, if parent region is roll back and it will not be closed if the table is disabling.

        Show
        chunhui shen added a comment - patch v2, check region is splitting through AssignmentManager#RegionState, Also I fix another a bug when region is disabling and split is rolled back, in current logic, if parent region is roll back and it will not be closed if the table is disabling.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Patch v2 does not include changes in v1? is it intended?

        Show
        ramkrishna.s.vasudevan added a comment - Patch v2 does not include changes in v1? is it intended?
        Hide
        chunhui shen added a comment -

        As stack's comment, patch v2 use zk's split info to check whether region is splitting, needn't HRegionServer keeping splitting-region list.

        Show
        chunhui shen added a comment - As stack's comment, patch v2 use zk's split info to check whether region is splitting, needn't HRegionServer keeping splitting-region list.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Sorry Chunhui.. Did not see Stack's comments fully. Will review and get back to this.
        Thanks

        Show
        ramkrishna.s.vasudevan added a comment - Sorry Chunhui.. Did not see Stack's comments fully. Will review and get back to this. Thanks
        Hide
        ramkrishna.s.vasudevan added a comment -

        Handling disabling when nodeDeleted is needed as we try doing in SplitRegionHandler.

        catch (NotServingRegionException nsre) {
               LOG.info("Server " + server + " returned " + nsre + " for " +
                 region.getRegionNameAsString());
        -      // Presume that master has stale data.  Presume remote side just split.
        -      // Presume that the split message when it comes in will fix up the master's
        -      // in memory cluster state.
        +      RegionState regionState = regionsInTransition.get(encodedName);
        +      if (regionState != null
        +          && (regionState.isSplitting() || regionState.isSplit())) {
        +        LOG.info("Region " + region.getRegionNameAsString()
        +            + " is splitting, return");
        +        return;
        +      } else {
        +        cancelClosingRegionIfDisabling(region);
        +      }
        

        Is this the correct place to add the code. I feel NSRE will be a remote exception and this check should come in the below catch(Throwable) where we check for NSRE.
        I remember happening lik that when i was working on some earlier issues. Correct me if am wrong. BTW, thanks for the patch.

        Show
        ramkrishna.s.vasudevan added a comment - Handling disabling when nodeDeleted is needed as we try doing in SplitRegionHandler. catch (NotServingRegionException nsre) { LOG.info( "Server " + server + " returned " + nsre + " for " + region.getRegionNameAsString()); - // Presume that master has stale data. Presume remote side just split. - // Presume that the split message when it comes in will fix up the master's - // in memory cluster state. + RegionState regionState = regionsInTransition.get(encodedName); + if (regionState != null + && (regionState.isSplitting() || regionState.isSplit())) { + LOG.info( "Region " + region.getRegionNameAsString() + + " is splitting, return " ); + return ; + } else { + cancelClosingRegionIfDisabling(region); + } Is this the correct place to add the code. I feel NSRE will be a remote exception and this check should come in the below catch(Throwable) where we check for NSRE. I remember happening lik that when i was working on some earlier issues. Correct me if am wrong. BTW, thanks for the patch.
        Hide
        chunhui shen added a comment -

        In our testing with 90 version, close region rpc return NSRE rather than a remote exception if region is already been closed, so that table will be diabling all the time.

        Maybe calling unassign(rs.getRegion()) in the nodeDeleted() will affect zk event.
        What about define a new state for RegionState.state which will retry closing if return NSRE.

        Show
        chunhui shen added a comment - In our testing with 90 version, close region rpc return NSRE rather than a remote exception if region is already been closed, so that table will be diabling all the time. Maybe calling unassign(rs.getRegion()) in the nodeDeleted() will affect zk event. What about define a new state for RegionState.state which will retry closing if return NSRE.
        Hide
        chunhui shen added a comment -

        patchv3 change unassign() in nodeDeleted() to invokeUnassign().

        Show
        chunhui shen added a comment - patchv3 change unassign() in nodeDeleted() to invokeUnassign().
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Chunhui
        +1 on making invokeUnassign(). But NSRE is my doubt. I will try out once if am able to reproduce it.

        Show
        ramkrishna.s.vasudevan added a comment - @Chunhui +1 on making invokeUnassign(). But NSRE is my doubt. I will try out once if am able to reproduce it.
        Hide
        Ted Yu added a comment -
        +        LOG.info("Rs= " + regionState + ", return");
        

        "Rs" may mean region server. "region " should be enough. I think we can also mention that NotServingRegionException is skipped.

        Show
        Ted Yu added a comment - + LOG.info( "Rs= " + regionState + ", return " ); "Rs" may mean region server. "region " should be enough. I think we can also mention that NotServingRegionException is skipped.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Chunhui
        I did not reproduce it as i got lost in some other work. But if the remote exception is not thrown why all the logic related to NSRE is present there.
        May be am wrong. But i cannot tell surely also as i did not reproduce it now.

        Show
        ramkrishna.s.vasudevan added a comment - @Chunhui I did not reproduce it as i got lost in some other work. But if the remote exception is not thrown why all the logic related to NSRE is present there. May be am wrong. But i cannot tell surely also as i did not reproduce it now.
        Hide
        chunhui shen added a comment -

        If AssignmentManager#unassign() catchs RemoteException rather than NotServingRegionException, could we remove catching NotServingRegionException?

        I'm not sure when catches RemoteException and when catches NotServingRegionException directly, from our logs I see it catches NotServingRegionException directly. Anyway, for this issue, we could treat samely for the two exceptions.

        Show
        chunhui shen added a comment - If AssignmentManager#unassign() catchs RemoteException rather than NotServingRegionException, could we remove catching NotServingRegionException? I'm not sure when catches RemoteException and when catches NotServingRegionException directly, from our logs I see it catches NotServingRegionException directly. Anyway, for this issue, we could treat samely for the two exceptions.

          People

          • Assignee:
            chunhui shen
            Reporter:
            chunhui shen
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development