HBase
  1. HBase
  2. HBASE-4083

If Enable table is not completed and is partial, then scanning of the table is not working

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.90.3
    • Fix Version/s: 0.92.0, 0.94.0
    • Component/s: None
    • Labels:
      None

      Description

      Consider the following scenario
      Start the Master, Backup master and RegionServer.
      Create a table which in turn creates a region.
      Disable the table.
      Enable the table again.
      Kill the Active master exactly at the point before the actual region assignment is started.
      Restart or switch master.
      Scan the table.
      NotServingRegionExcepiton is thrown.

      1. HBASE-4083_trunk_1.patch
        17 kB
        ramkrishna.s.vasudevan
      2. HBASE-4083_0.90_1.patch
        16 kB
        ramkrishna.s.vasudevan
      3. HBASE-4083_trunk.patch
        15 kB
        ramkrishna.s.vasudevan
      4. HBASE-4083_0.90.patch
        15 kB
        ramkrishna.s.vasudevan
      5. HBASE-4083-1.patch
        13 kB
        ramkrishna.s.vasudevan

        Activity

        ramkrishna.s.vasudevan created issue -
        ramkrishna.s.vasudevan made changes -
        Field Original Value New Value
        Assignee ramkrishna.s.vasudevan [ ram_krish ]
        Hide
        ramkrishna.s.vasudevan added a comment -

        This issue is similar to HBASE-4052 (improper disabling of table).
        Following scenarios needs to be considered.
        If a table T1 has three regions R1, R2 and R3
        The table T1 is disabled successfully.
        We now issue an enable command for T1
        Scenario 1:
        ===========
        All the regions R1, R2 and R3 are onlined and the state in zookeeper for the table is ENABLING and then the Active master went down,
        R1- online
        R2- online T1-ENABLING
        R3-online
        Here the scan will work properly but the table state will be ENABLING.(This is not a problem in the user perspective when he tries to scan but as per the state of the system this is a serious concern).
        Scenario:2
        ==========
        None of the regions R1, R2 and R3 are onlined but the Active Master went down when the zookeeper state for the table was in ENABLING.
        R1- Offlined
        R2- Offlined T1-ENABLING
        R3-Offlined
        Here the scan operation performed on the table will throw NotServingRegionException.
        Scenario:3
        ========
        The regions R1, R2 are onlined and R3 are not yet onlined and the Active Master went down when the zookeeper state was ENABLING.
        R1- online
        R2- online T1-ENABLING
        R3-offlined

        Here the offlined regions cannot be scanning and NotServingRegionException will be thrown.

        Show
        ramkrishna.s.vasudevan added a comment - This issue is similar to HBASE-4052 (improper disabling of table). Following scenarios needs to be considered. If a table T1 has three regions R1, R2 and R3 The table T1 is disabled successfully. We now issue an enable command for T1 Scenario 1: =========== All the regions R1, R2 and R3 are onlined and the state in zookeeper for the table is ENABLING and then the Active master went down, R1- online R2- online T1-ENABLING R3-online Here the scan will work properly but the table state will be ENABLING.(This is not a problem in the user perspective when he tries to scan but as per the state of the system this is a serious concern). Scenario:2 ========== None of the regions R1, R2 and R3 are onlined but the Active Master went down when the zookeeper state for the table was in ENABLING. R1- Offlined R2- Offlined T1-ENABLING R3-Offlined Here the scan operation performed on the table will throw NotServingRegionException. Scenario:3 ======== The regions R1, R2 are onlined and R3 are not yet onlined and the Active Master went down when the zookeeper state was ENABLING. R1- online R2- online T1-ENABLING R3-offlined Here the offlined regions cannot be scanning and NotServingRegionException will be thrown.
        Hide
        ramkrishna.s.vasudevan added a comment -

        The soln can be similar as that of HBASE-4052.
        Here we have to make following changes

        => if the table state is in ENABLING state do not populate in the regions map.
        => Recover by calling EnableProcessHandler. this will try to enable the table by
        onlining the regions.
        Case 1:
        ======
        If the regions were not onlined.
        => The enable handler will online all the regions and make the table state to
        ENABLED
        => Scanning will work fine

        Case 2 :
        ========
        If the regions were not onlined
        => There is a check in the OpenRegionHandler to see if the region is already onlined
        => Move the check to HRegionServer openRegion() api.
        => Create a new exception class AlreadyOnlinedException and throw it from RS.
        => In the assign() api in AssignmentManager catch this exception and add those
        regions to the regions map.
        => This will ensure that the table is moved to ENABLED state.
        Note:
        =====
        Here instead of throwing an exception we can add a return type to openRegion() in the HRegionInterface. Both involves an interface change.
        Can we proceed like this or is there any other better way?

        Show
        ramkrishna.s.vasudevan added a comment - The soln can be similar as that of HBASE-4052 . Here we have to make following changes => if the table state is in ENABLING state do not populate in the regions map. => Recover by calling EnableProcessHandler. this will try to enable the table by onlining the regions. Case 1: ====== If the regions were not onlined. => The enable handler will online all the regions and make the table state to ENABLED => Scanning will work fine Case 2 : ======== If the regions were not onlined => There is a check in the OpenRegionHandler to see if the region is already onlined => Move the check to HRegionServer openRegion() api. => Create a new exception class AlreadyOnlinedException and throw it from RS. => In the assign() api in AssignmentManager catch this exception and add those regions to the regions map. => This will ensure that the table is moved to ENABLED state. Note: ===== Here instead of throwing an exception we can add a return type to openRegion() in the HRegionInterface. Both involves an interface change. Can we proceed like this or is there any other better way?
        Hide
        Ted Yu added a comment -

        I don't see EnableProcessHandler class. Did you mean EnableTableHandler ?

        For case 2, since OpenRegionHandler.process() doesn't throw exception for online region, I think we should let openRegion() return boolean:

          public boolean openRegion(HRegionInfo region)
        
        Show
        Ted Yu added a comment - I don't see EnableProcessHandler class. Did you mean EnableTableHandler ? For case 2, since OpenRegionHandler.process() doesn't throw exception for online region, I think we should let openRegion() return boolean: public boolean openRegion(HRegionInfo region)
        ramkrishna.s.vasudevan made changes -
        Attachment HBASE-4083-1.patch [ 12486575 ]
        Hide
        ramkrishna.s.vasudevan added a comment -

        The patch attached is for review.(not the formal patch).
        As there are changes in AssignmentManager.java unless HBASE-4052 gets committed I cannot create the actual patch.
        This patch is from 0.90 branch. Includes changes for both HBASE-4052 and HBASE-4083.
        Once HBASE-4052 is committed will submit the patch for trunk also as similar changes needs to be done.

        Show
        ramkrishna.s.vasudevan added a comment - The patch attached is for review.(not the formal patch). As there are changes in AssignmentManager.java unless HBASE-4052 gets committed I cannot create the actual patch. This patch is from 0.90 branch. Includes changes for both HBASE-4052 and HBASE-4083 . Once HBASE-4052 is committed will submit the patch for trunk also as similar changes needs to be done.
        Hide
        Ted Yu added a comment -

        Please generate new patch now that HBASE-4052 is resolved.

        +          } catch (KeeperException e) {
        +            master.abort(
        +                "Error deleting OFFLINED node in ZK for transition ZK node ("
        

        Can we deal with the error in a better way ?

        Show
        Ted Yu added a comment - Please generate new patch now that HBASE-4052 is resolved. + } catch (KeeperException e) { + master.abort( + "Error deleting OFFLINED node in ZK for transition ZK node (" Can we deal with the error in a better way ?
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Ted, thanks for the review.
        We need to delete the node in zookeeper which is in offline state. Because as part of recovery when we try to enable the table , the regions of the table may be already onlined so we just try to delete the node created for that region.
        If we dont delete the node, scans can be performed but if we try to disable the table it may not work as already the node is created in zookeeper.
        So thats why any failure while deleting the node i thouhgt of aborting the master which is the way we have handled in
        public void offlineDisabledRegion(HRegionInfo regionInfo) and in
        OpenedRegionHandler.process() api.
        Please provide your comments.

        Show
        ramkrishna.s.vasudevan added a comment - @Ted, thanks for the review. We need to delete the node in zookeeper which is in offline state. Because as part of recovery when we try to enable the table , the regions of the table may be already onlined so we just try to delete the node created for that region. If we dont delete the node, scans can be performed but if we try to disable the table it may not work as already the node is created in zookeeper. So thats why any failure while deleting the node i thouhgt of aborting the master which is the way we have handled in public void offlineDisabledRegion(HRegionInfo regionInfo) and in OpenedRegionHandler.process() api. Please provide your comments.
        Hide
        Ted Yu added a comment -

        Shall we create an enum and use that as return type for:

        +  public boolean openRegion(final HRegionInfo region) throws IOException;
        

        so that future addition for the above method has room for manipulation without changing RPC again.

        When deleting the node, I think we should take care of the following case of KeeperException and not abort server:

            } catch (KeeperException.NoNodeException nne) {
        
        Show
        Ted Yu added a comment - Shall we create an enum and use that as return type for: + public boolean openRegion( final HRegionInfo region) throws IOException; so that future addition for the above method has room for manipulation without changing RPC again. When deleting the node, I think we should take care of the following case of KeeperException and not abort server: } catch (KeeperException.NoNodeException nne) {
        Hide
        ramkrishna.s.vasudevan added a comment -

        So if the node is not present, then we have no problems we can just log it.
        If we are not able to delete the node then? so subsequent disable may cause issue right ?

        Show
        ramkrishna.s.vasudevan added a comment - So if the node is not present, then we have no problems we can just log it. If we are not able to delete the node then? so subsequent disable may cause issue right ?
        Hide
        Ted Yu added a comment -

        If we are unable to delete the node, we can stop the server.

        Show
        Ted Yu added a comment - If we are unable to delete the node, we can stop the server.
        ramkrishna.s.vasudevan made changes -
        Attachment HBASE-4083_0.90.patch [ 12486992 ]
        ramkrishna.s.vasudevan made changes -
        Attachment HBASE-4083_trunk.patch [ 12486993 ]
        ramkrishna.s.vasudevan made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Affects Version/s 0.90.3 [ 12316313 ]
        Fix Version/s 0.90.4 [ 12316406 ]
        Fix Version/s 0.92.0 [ 12314223 ]
        Hide
        ramkrishna.s.vasudevan added a comment -

        I am not able to write the testcase to cover the scenarios as it involves killing the master when the table state in zookeeper is changed to ENABLING.

        Show
        ramkrishna.s.vasudevan added a comment - I am not able to write the testcase to cover the scenarios as it involves killing the master when the table state in zookeeper is changed to ENABLING.
        Hide
        stack added a comment -

        We can't do this for 0.90:

        -  public void openRegion(final HRegionInfo region) throws IOException;
        +  public RegionOpeningState openRegion(final HRegionInfo region) throws IOException;
        

        Doesn't it change the interface? I'm afraid this will break our ability to do a rolling restart between point releases on 0.90 (It might just work but unless you have tried it...)

        I like this moving the check of whether region is online or not up into RegionServer and out of the handler.

        This patch looks great otherwise. I like whats going on in here. Thanks for doing the work Ram.

        Show
        stack added a comment - We can't do this for 0.90: - public void openRegion( final HRegionInfo region) throws IOException; + public RegionOpeningState openRegion( final HRegionInfo region) throws IOException; Doesn't it change the interface? I'm afraid this will break our ability to do a rolling restart between point releases on 0.90 (It might just work but unless you have tried it...) I like this moving the check of whether region is online or not up into RegionServer and out of the handler. This patch looks great otherwise. I like whats going on in here. Thanks for doing the work Ram.
        Hide
        stack added a comment -

        TRUNK patch looks good to me. Do you need this fix in 0.90 Ram?

        Show
        stack added a comment - TRUNK patch looks good to me. Do you need this fix in 0.90 Ram?
        Hide
        stack added a comment -

        If you do, then I think you need to prove that a client that doesn't have this patch can talk to a server that does have it (We can't break rolling upgrades)

        Show
        stack added a comment - If you do, then I think you need to prove that a client that doesn't have this patch can talk to a server that does have it (We can't break rolling upgrades)
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Stack, yes it is an interface change. Thought a while before doing it.
        The rolling restart behaviour with and without patch of 0.90 releases i will check it.

        Show
        ramkrishna.s.vasudevan added a comment - @Stack, yes it is an interface change. Thought a while before doing it. The rolling restart behaviour with and without patch of 0.90 releases i will check it.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Stack,
        I found one problem with the patch
        The return type RegionOpeningStates has to be added in the map in HbaseObjectWritable otherwise IPC error is thrown saying unexpected code.

        Is there any where else we need to change. Because i found a comment in the HRegionInterface saying that
        any change done here we need to change the HBaseRPCProtocolVersion.java.
        Do i need to change the version ?

        One more thing is I tested the scenario that you had mentioned,
        The Master had the patch whereas the RS did not have the patch. So whenever a region is opened it return a null value in the master side.
        So there is no compatability problem but the scenario of partial enabling may not work.
        Is it fine Stack? Do i need to verify any other thing?
        I will resubmit the patch shortly.

        Show
        ramkrishna.s.vasudevan added a comment - @Stack, I found one problem with the patch The return type RegionOpeningStates has to be added in the map in HbaseObjectWritable otherwise IPC error is thrown saying unexpected code. Is there any where else we need to change. Because i found a comment in the HRegionInterface saying that any change done here we need to change the HBaseRPCProtocolVersion.java. Do i need to change the version ? One more thing is I tested the scenario that you had mentioned, The Master had the patch whereas the RS did not have the patch. So whenever a region is opened it return a null value in the master side. So there is no compatability problem but the scenario of partial enabling may not work. Is it fine Stack? Do i need to verify any other thing? I will resubmit the patch shortly.
        Hide
        stack added a comment -

        The return type RegionOpeningStates has to be added in the map in HbaseObjectWritable otherwise IPC error is thrown saying unexpected code.

        Good. You tested this distributed. Funny how distributed turns up issues. Thanks for doing this.

        Don't change the ipc version. This will for sure break rolling restarts.

        Sounds like you did sufficient testing. I'll test rolling restart when we cut a 0.90.4. If its broken, will back out this patch in 0.90 and cut a new release candidate. I'm willing to do this given the testing you did above Ram. Thanks.

        Show
        stack added a comment - The return type RegionOpeningStates has to be added in the map in HbaseObjectWritable otherwise IPC error is thrown saying unexpected code. Good. You tested this distributed. Funny how distributed turns up issues. Thanks for doing this. Don't change the ipc version. This will for sure break rolling restarts. Sounds like you did sufficient testing. I'll test rolling restart when we cut a 0.90.4. If its broken, will back out this patch in 0.90 and cut a new release candidate. I'm willing to do this given the testing you did above Ram. Thanks.
        ramkrishna.s.vasudevan made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        ramkrishna.s.vasudevan made changes -
        Attachment HBASE-4083_0.90_1.patch [ 12487270 ]
        ramkrishna.s.vasudevan made changes -
        Attachment HBASE-4083_trunk_1.patch [ 12487272 ]
        ramkrishna.s.vasudevan made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        stack made changes -
        Fix Version/s 0.94.0 [ 12316419 ]
        Fix Version/s 0.92.0 [ 12314223 ]
        Fix Version/s 0.90.4 [ 12316406 ]
        stack made changes -
        Fix Version/s 0.90.5 [ 12317145 ]
        Fix Version/s 0.94.0 [ 12316419 ]
        Description
        Consider the following scenario
        Start the Master, Backup master and RegionServer.
        Create a table which in turn creates a region.
        Disable the table.
        Enable the table again.
        Kill the Active master exactly at the point before the actual region assignment is started.
        Restart or switch master.
        Scan the table.
        NotServingRegionExcepiton is thrown.
        Consider the following scenario
        Start the Master, Backup master and RegionServer.
        Create a table which in turn creates a region.
        Disable the table.
        Enable the table again.
        Kill the Active master exactly at the point before the actual region assignment is started.
        Restart or switch master.
        Scan the table.
        NotServingRegionExcepiton is thrown.
        Hide
        stack added a comment -

        Applied to TRUNK. Thanks Ram.

        (i swapped order under openRegion so we check if in transition first and THEN check if its open rather than other way round).

        Show
        stack added a comment - Applied to TRUNK. Thanks Ram. (i swapped order under openRegion so we check if in transition first and THEN check if its open rather than other way round).
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2055 (See https://builds.apache.org/job/HBase-TRUNK/2055/)
        HBASE-4083 If Enable table is not completed and is partial, then scanning of the table is not working

        stack :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java
        • /hbase/trunk/CHANGES.txt
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/HbaseObjectWritable.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/RegionOpeningState.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2055 (See https://builds.apache.org/job/HBase-TRUNK/2055/ ) HBASE-4083 If Enable table is not completed and is partial, then scanning of the table is not working stack : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/HbaseObjectWritable.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/RegionOpeningState.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        stack made changes -
        Fix Version/s 0.90.6 [ 12319200 ]
        Fix Version/s 0.90.5 [ 12317145 ]
        Hide
        ramkrishna.s.vasudevan added a comment -

        Not fixed in 0.90. Hence not resolving the issue. But committed in trunk and 0.92

        Show
        ramkrishna.s.vasudevan added a comment - Not fixed in 0.90. Hence not resolving the issue. But committed in trunk and 0.92
        ramkrishna.s.vasudevan made changes -
        Fix Version/s 0.90.7 [ 12319481 ]
        Fix Version/s 0.92.0 [ 12314223 ]
        Fix Version/s 0.90.6 [ 12319200 ]
        Hide
        Jonathan Hsieh added a comment -

        Removed from 0.90, was committed to trunk/0.94.0 back in July 2011.. Please file new issue if you want to get it into 0.90.

        Show
        Jonathan Hsieh added a comment - Removed from 0.90, was committed to trunk/0.94.0 back in July 2011.. Please file new issue if you want to get it into 0.90.
        Jonathan Hsieh made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Fix Version/s 0.94.0 [ 12316419 ]
        Fix Version/s 0.90.7 [ 12319481 ]
        Resolution Fixed [ 1 ]
        Lars Hofhansl made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            ramkrishna.s.vasudevan
            Reporter:
            ramkrishna.s.vasudevan
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development