HBase
  1. HBase
  2. HBASE-10136

the table-lock of TableEventHandler is released too early because reOpenAllRegions() is asynchronous

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.98.0, 0.96.0, 0.99.0
    • Fix Version/s: None
    • Component/s: master

      Description

      Expected behavior:
      With the introduction of the table-lock, a user can issue a request for a snapshot of a table while that table is undergoing an online schema change and expect that snapshot request to complete correctly. Also, the same is true if a user issues a online schema change request while a snapshot attempt is ongoing.

      Observed behavior:
      Snapshot attempts time out when there is an ongoing online schema change because the table lock is not acquired by anyone else and the regions are closed and opened during the snapshot.

      TableEventHandler trace

      // 1. client.addColumn() call from client...
      
      // 2. The operation is now on the master
      2013-12-12 12:09:57,613 DEBUG [MASTER] lock.ZKInterProcessLockBase: Acquired a lock for /hbase/table-lock/TestTable/write-master:452010000000001
      2013-12-12 12:09:57,640 INFO  [MASTER] handler.TableEventHandler: Handling table operation C_M_ADD_FAMILY on table TestTable
      2013-12-12 12:09:57,685 INFO  [MASTER] master.MasterFileSystem: AddColumn. Table = TestTable HCD = {NAME => 'x-1386850197327', DATA_BLOCK_ENCODING => 'NONE',$
      2013-12-12 12:09:57,693 INFO  [MASTER] handler.TableEventHandler: Bucketing regions by region server...
      ...
      2013-12-12 12:09:57,771 INFO  [MASTER] handler.TableEventHandler: Completed table operation C_M_ADD_FAMILY on table TestTable
      2013-12-12 12:09:57,771 DEBUG [MASTER] master.AssignmentManager: Starting unassign of TestTable,,1386849056038.854b280$
      2013-12-12 12:09:57,772 DEBUG [MASTER] lock.ZKInterProcessLockBase: Released /hbase/table-lock/TestTable/write-master:452010000000001
      
      // 3. The Table*Handler operation is now completed, and the client notified with "I'm done!"
      
      // 4. Now the BulkReopen is starting doing the reopen
      2013-12-12 12:09:57,772 INFO  [MASTER] master.RegionStates: Transitioned {854b280006aec464083778a5cb5f5456 state=OPEN,$
      
      1. HBASE-10136-trunk.patch
        6 kB
        Aleksandr Shulman
      2. HBASE-10136-v0.patch
        4 kB
        Matteo Bertozzi

        Issue Links

          Activity

          Hide
          Jimmy Xiang added a comment -

          So you just reuse the waitUntilDone. Looks fine to me. I was wondering why we have to re-assign regions one by one.

          Show
          Jimmy Xiang added a comment - So you just reuse the waitUntilDone. Looks fine to me. I was wondering why we have to re-assign regions one by one.
          Hide
          Jimmy Xiang added a comment -

          HBASE-10137 changed the table enabler. Does that fix work for this case too?

          Show
          Jimmy Xiang added a comment - HBASE-10137 changed the table enabler. Does that fix work for this case too?
          Hide
          Aleksandr Shulman added a comment -

          Adding a patch for a test that exposes this issue. Test should pass once this issue is resolved.

          Show
          Aleksandr Shulman added a comment - Adding a patch for a test that exposes this issue. Test should pass once this issue is resolved.
          Hide
          Matteo Bertozzi added a comment -

          Sergey Shelukhin we're not talking about snapshots here. Currently snapshot are built to fail if a region is moving or is down, and this is by design. If you want to talk about how to fix this open another jira.

          The problem here is the TableEventHandler and when the table lock is released,
          for example if you call modifyTable() twice or you have a split concurrently with modifyTable() you don't get the expected behavior that we want with the table lock, which should be an operation on the table is locked until the other is completed.

          also the other problem, not completly related, that I'm pointing out is that since we have this async complete the client is not synchronous

          Show
          Matteo Bertozzi added a comment - Sergey Shelukhin we're not talking about snapshots here. Currently snapshot are built to fail if a region is moving or is down, and this is by design. If you want to talk about how to fix this open another jira. The problem here is the TableEventHandler and when the table lock is released, for example if you call modifyTable() twice or you have a split concurrently with modifyTable() you don't get the expected behavior that we want with the table lock, which should be an operation on the table is locked until the other is completed. also the other problem, not completly related, that I'm pointing out is that since we have this async complete the client is not synchronous
          Hide
          Sergey Shelukhin added a comment -

          If a server fails the region will not stay open... I don't think it's a good idea to rely on that. Locking would work as a temporary fix I guess, for this particular interaction. But why cannot snapshot handle the general case of regions becoming unavailable? It's not like close-open takes time like recovery does during alter table.

          Show
          Sergey Shelukhin added a comment - If a server fails the region will not stay open... I don't think it's a good idea to rely on that. Locking would work as a temporary fix I guess, for this particular interaction. But why cannot snapshot handle the general case of regions becoming unavailable? It's not like close-open takes time like recovery does during alter table.
          Hide
          Jonathan Hsieh added a comment -

          I agree with needing rules – the invariant I think we need here is that if an operation starts with a region in open state and is supposed to complete with the regions in open state, it must be open. (or a suitable replacement must be open).

          Currently I only see open/close/open conflicts. (splits/alters, likely merges). can we get away with just "fixing" those three operations so that their respective table locks are held until the opens complete? Is the wait until handler completion needed for any other operations?

          Show
          Jonathan Hsieh added a comment - I agree with needing rules – the invariant I think we need here is that if an operation starts with a region in open state and is supposed to complete with the regions in open state, it must be open. (or a suitable replacement must be open). Currently I only see open/close/open conflicts. (splits/alters, likely merges). can we get away with just "fixing" those three operations so that their respective table locks are held until the opens complete? Is the wait until handler completion needed for any other operations?
          Hide
          Matteo Bertozzi added a comment -

          as I pointed out in my comment above we can fix case by case, but the main problems are still there.
          If you implement a new handler you have to keep in mind the rules to make everything working.
          In this case for example, the end of the handler is not the hand of the operation so the lock is released early.
          also in this case, the master call uses handler.process() instead of the executor pool to make the client operation synchronous.
          In the delete table case the last operation must be the removal of the table descriptor, otherwise the client call will not be synchronous.
          ...and so on with other, implementation details.

          I've pointed out the new master design, to discuss this set of "rules" and be part of the design.
          We must be able to know when an operation end, and not just guessing based on the result state of an operation. And this is a must for both server side (e.g. releasing the lock) and client side (e.g. sync operation)

          Show
          Matteo Bertozzi added a comment - as I pointed out in my comment above we can fix case by case, but the main problems are still there. If you implement a new handler you have to keep in mind the rules to make everything working. In this case for example, the end of the handler is not the hand of the operation so the lock is released early. also in this case, the master call uses handler.process() instead of the executor pool to make the client operation synchronous. In the delete table case the last operation must be the removal of the table descriptor, otherwise the client call will not be synchronous. ...and so on with other, implementation details. I've pointed out the new master design, to discuss this set of "rules" and be part of the design. We must be able to know when an operation end, and not just guessing based on the result state of an operation. And this is a must for both server side (e.g. releasing the lock) and client side (e.g. sync operation)
          Hide
          Sergey Shelukhin added a comment -

          Agree with Jon, it looks like the implementation detail to me.

          Show
          Sergey Shelukhin added a comment - Agree with Jon, it looks like the implementation detail to me.
          Hide
          Jonathan Hsieh added a comment -

          The most clear fix to this is to fix the master itself I think (HBASE-5487).

          While I think this kind of "race" is something that new HBASE-5487 designs should handle, I disagree that that is the clearest way. I do think open/close/open cases can be handled within the current framework.

          Show
          Jonathan Hsieh added a comment - The most clear fix to this is to fix the master itself I think ( HBASE-5487 ). While I think this kind of "race" is something that new HBASE-5487 designs should handle, I disagree that that is the clearest way. I do think open/close/open cases can be handled within the current framework.
          Hide
          Enis Soztutar added a comment -

          Matteo you are right in the analysis. The table lock is released before the regions are complete with opening, because of how region reopening and the master handlers are implemented. The most clear fix to this is to fix the master itself I think (HBASE-5487).

          Show
          Enis Soztutar added a comment - Matteo you are right in the analysis. The table lock is released before the regions are complete with opening, because of how region reopening and the master handlers are implemented. The most clear fix to this is to fix the master itself I think ( HBASE-5487 ).
          Hide
          Matteo Bertozzi added a comment -

          This is a more generic problem than alter table and snapshots.
          both snapshots and alter table seems to acquire/release the table-lock properly,
          but the BulkReopen operation called by TableEventHandler.reOpenAllRegions() is asynchronous

          [hbase/master/BulkReOpen.java]
          /** Reopen the regions asynchronously, so always returns true immediately. */
          protected boolean waitUntilDone(long timeout) {
            return true;
          }
          

          this means that each table operation that is using TableEventHandler.reOpenAllRegions(), in this case the alter table operation, is releasing the lock before the operation is completed.

          If you just try to call admin.addColumn() and take a look at the logs you can easily see that the operation is done in an async way.

          // 1. client.addColumn() call from client...
          
          // 2. The operation is now on the master
          2013-12-12 12:09:57,613 DEBUG [MASTER] lock.ZKInterProcessLockBase: Acquired a lock for /hbase/table-lock/TestTable/write-master:452010000000001
          2013-12-12 12:09:57,640 INFO  [MASTER] handler.TableEventHandler: Handling table operation C_M_ADD_FAMILY on table TestTable
          2013-12-12 12:09:57,685 INFO  [MASTER] master.MasterFileSystem: AddColumn. Table = TestTable HCD = {NAME => 'x-1386850197327', DATA_BLOCK_ENCODING => 'NONE',$
          2013-12-12 12:09:57,693 INFO  [MASTER] handler.TableEventHandler: Bucketing regions by region server...
          ...
          2013-12-12 12:09:57,771 INFO  [MASTER] handler.TableEventHandler: Completed table operation C_M_ADD_FAMILY on table TestTable
          2013-12-12 12:09:57,771 DEBUG [MASTER] master.AssignmentManager: Starting unassign of TestTable,,1386849056038.854b280$
          2013-12-12 12:09:57,772 DEBUG [MASTER] lock.ZKInterProcessLockBase: Released /hbase/table-lock/TestTable/write-master:452010000000001
          
          // 3. The Table*Handler operation is now completed, and the client notified with "I'm done!"
          
          // 4. Now the BulkReopen is starting doing the reopen
          2013-12-12 12:09:57,772 INFO  [MASTER] master.RegionStates: Transitioned {854b280006aec464083778a5cb5f5456 state=OPEN,$
          2013-12-12 12:09:57,772 INFO  [Priority.RpcServer.handler=5,port=39384] regionserver.HRegionServer: Close 854b280006aec464083778a5cb5f5456, via zk=yes, znode version=0, on null
          2013-12-12 12:09:57,773 DEBUG [RS_CLOSE_REGION] handler.CloseRegionHandler: Processing close of TestTable,,1386849056038.854b280006aec464083778a5cb5f5456.
          2013-12-12 12:09:57,773 DEBUG [MASTER] master.AssignmentManager: Sent CLOSE to localhost,39384,1386848453374 for regio$
          2013-12-12 12:09:57,773 DEBUG [RS_CLOSE_REGION] regionserver.HRegion: Closing TestTable,,1386849056038.854b280006aec464083778a5cb5f5456.: disabling compactions & flush$
          2013-12-12 12:09:57,773 DEBUG [RS_CLOSE_REGION] regionserver.HRegion: Updates disabled for region TestTable,,1386849056038.854b280006aec464083778a5cb5f5456.
          2013-12-12 12:09:57,774 INFO  [StoreCloserThread-TestTable,,1386849056038.854b280006aec464083778a5cb5f5456.-1] regionserver.HStore: Closed info
          2013-12-12 12:09:57,774 INFO  [RS_CLOSE_REGION] regionserver.HRegion: Closed TestTable,,1386849056038.854b280006aec464083778a5cb5f5456.
          2013-12-12 12:09:57,774 DEBUG [RS_CLOSE_REGION] zookeeper.ZKAssign: regionserver:39384-0x142e69bfea20001, quorum=localhost:2181, baseZNode=/hbase Transitioning 854b280$
          2013-12-12 12:09:57,775 DEBUG [RS_CLOSE_REGION] zookeeper.ZKAssign: regionserver:39384-0x142e69bfea20001, quorum=localhost:2181, baseZNode=/hbase Transitioned node 854$
          2013-12-12 12:09:57,775 DEBUG [RS_CLOSE_REGION] handler.CloseRegionHandler: Set closed state in zk for TestTable,,1386849056038.854b280006aec464083778a5cb5f5456. on lo$
          2013-12-12 12:09:57,775 DEBUG [RS_CLOSE_REGION] handler.CloseRegionHandler: Closed TestTable,,1386849056038.854b280006aec464083778a5cb5f5456.
          2013-12-12 12:09:57,775 INFO  [ProcessThread(sid:0 cport:-1):] server.PrepRequestProcessor: Processed session termination for sessionid: 0x142e69bfea20023
          2013-12-12 12:09:57,775 DEBUG [AM.ZK.Worker-pool2-t70] master.AssignmentManager: Handling RS_ZK_REGION_CLOSED, server=localhost,39384,1386848453374, region=854b280006aec464083778a5cb5f5$
          

          so, this is another case where the Admin operation is not synchronous and also the table-lock is not really able to do its job, since at the end of the handler.process() doesn't match with "operation completed".

          we can probably do a dirty fix for each case as they come up (see also HBASE-6992).
          but, This is a general problem with the current Master/Client Design. We don't really know when an operation is completed. Looking at the new master designs (HBASE-5487) looks like this stuff will be resolved, but the new master seems distant future...

          Show
          Matteo Bertozzi added a comment - This is a more generic problem than alter table and snapshots. both snapshots and alter table seems to acquire/release the table-lock properly, but the BulkReopen operation called by TableEventHandler.reOpenAllRegions() is asynchronous [hbase/master/BulkReOpen.java] /** Reopen the regions asynchronously, so always returns true immediately. */ protected boolean waitUntilDone( long timeout) { return true ; } this means that each table operation that is using TableEventHandler.reOpenAllRegions(), in this case the alter table operation, is releasing the lock before the operation is completed. If you just try to call admin.addColumn() and take a look at the logs you can easily see that the operation is done in an async way. // 1. client.addColumn() call from client... // 2. The operation is now on the master 2013-12-12 12:09:57,613 DEBUG [MASTER] lock.ZKInterProcessLockBase: Acquired a lock for /hbase/table-lock/TestTable/write-master:452010000000001 2013-12-12 12:09:57,640 INFO [MASTER] handler.TableEventHandler: Handling table operation C_M_ADD_FAMILY on table TestTable 2013-12-12 12:09:57,685 INFO [MASTER] master.MasterFileSystem: AddColumn. Table = TestTable HCD = {NAME => 'x-1386850197327', DATA_BLOCK_ENCODING => 'NONE',$ 2013-12-12 12:09:57,693 INFO [MASTER] handler.TableEventHandler: Bucketing regions by region server... ... 2013-12-12 12:09:57,771 INFO [MASTER] handler.TableEventHandler: Completed table operation C_M_ADD_FAMILY on table TestTable 2013-12-12 12:09:57,771 DEBUG [MASTER] master.AssignmentManager: Starting unassign of TestTable,,1386849056038.854b280$ 2013-12-12 12:09:57,772 DEBUG [MASTER] lock.ZKInterProcessLockBase: Released /hbase/table-lock/TestTable/write-master:452010000000001 // 3. The Table*Handler operation is now completed, and the client notified with "I'm done!" // 4. Now the BulkReopen is starting doing the reopen 2013-12-12 12:09:57,772 INFO [MASTER] master.RegionStates: Transitioned {854b280006aec464083778a5cb5f5456 state=OPEN,$ 2013-12-12 12:09:57,772 INFO [Priority.RpcServer.handler=5,port=39384] regionserver.HRegionServer: Close 854b280006aec464083778a5cb5f5456, via zk=yes, znode version=0, on null 2013-12-12 12:09:57,773 DEBUG [RS_CLOSE_REGION] handler.CloseRegionHandler: Processing close of TestTable,,1386849056038.854b280006aec464083778a5cb5f5456. 2013-12-12 12:09:57,773 DEBUG [MASTER] master.AssignmentManager: Sent CLOSE to localhost,39384,1386848453374 for regio$ 2013-12-12 12:09:57,773 DEBUG [RS_CLOSE_REGION] regionserver.HRegion: Closing TestTable,,1386849056038.854b280006aec464083778a5cb5f5456.: disabling compactions & flush$ 2013-12-12 12:09:57,773 DEBUG [RS_CLOSE_REGION] regionserver.HRegion: Updates disabled for region TestTable,,1386849056038.854b280006aec464083778a5cb5f5456. 2013-12-12 12:09:57,774 INFO [StoreCloserThread-TestTable,,1386849056038.854b280006aec464083778a5cb5f5456.-1] regionserver.HStore: Closed info 2013-12-12 12:09:57,774 INFO [RS_CLOSE_REGION] regionserver.HRegion: Closed TestTable,,1386849056038.854b280006aec464083778a5cb5f5456. 2013-12-12 12:09:57,774 DEBUG [RS_CLOSE_REGION] zookeeper.ZKAssign: regionserver:39384-0x142e69bfea20001, quorum=localhost:2181, baseZNode=/hbase Transitioning 854b280$ 2013-12-12 12:09:57,775 DEBUG [RS_CLOSE_REGION] zookeeper.ZKAssign: regionserver:39384-0x142e69bfea20001, quorum=localhost:2181, baseZNode=/hbase Transitioned node 854$ 2013-12-12 12:09:57,775 DEBUG [RS_CLOSE_REGION] handler.CloseRegionHandler: Set closed state in zk for TestTable,,1386849056038.854b280006aec464083778a5cb5f5456. on lo$ 2013-12-12 12:09:57,775 DEBUG [RS_CLOSE_REGION] handler.CloseRegionHandler: Closed TestTable,,1386849056038.854b280006aec464083778a5cb5f5456. 2013-12-12 12:09:57,775 INFO [ProcessThread(sid:0 cport:-1):] server.PrepRequestProcessor: Processed session termination for sessionid: 0x142e69bfea20023 2013-12-12 12:09:57,775 DEBUG [AM.ZK.Worker-pool2-t70] master.AssignmentManager: Handling RS_ZK_REGION_CLOSED, server=localhost,39384,1386848453374, region=854b280006aec464083778a5cb5f5$ so, this is another case where the Admin operation is not synchronous and also the table-lock is not really able to do its job, since at the end of the handler.process() doesn't match with "operation completed". we can probably do a dirty fix for each case as they come up (see also HBASE-6992 ). but, This is a general problem with the current Master/Client Design. We don't really know when an operation is completed. Looking at the new master designs ( HBASE-5487 ) looks like this stuff will be resolved, but the new master seems distant future...
          Hide
          Aleksandr Shulman added a comment -

          Sorry, yes, I should have phrased it as:


          A user can issue a request for a snapshot of a table while that table is undergoing an online schema change and expect that snapshot request to complete correctly. Also, the same is true if a user issues a online schema change request while a snapshot attempt is ongoing.

          Show
          Aleksandr Shulman added a comment - Sorry, yes, I should have phrased it as: A user can issue a request for a snapshot of a table while that table is undergoing an online schema change and expect that snapshot request to complete correctly. Also, the same is true if a user issues a online schema change request while a snapshot attempt is ongoing.
          Hide
          Andrew Purtell added a comment -

          Expected behavior:
          A user can take a snapshot of a table while that table is undergoing an online schema change.

          It would be safer for one to hold for the other, and vice versa.

          Show
          Andrew Purtell added a comment - Expected behavior: A user can take a snapshot of a table while that table is undergoing an online schema change. It would be safer for one to hold for the other, and vice versa.
          Hide
          Aleksandr Shulman added a comment -

          A potential solution might be table locking:
          With the table lock we would expect the modifyTable to wait for the snapshot to complete or the snapshot to wait the modifyTable to complete.

          Show
          Aleksandr Shulman added a comment - A potential solution might be table locking: With the table lock we would expect the modifyTable to wait for the snapshot to complete or the snapshot to wait the modifyTable to complete.

            People

            • Assignee:
              Matteo Bertozzi
              Reporter:
              Aleksandr Shulman
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development