Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-22105

Snapshot of the table fails when region is transitioning

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: snapshots
    • Labels:
      None

      Description

      Region is moving(MoveRegionProcedure)
      Unassign Procedure completed on source node:-

      2019-03-18 06:56:12,027 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=819809c43ee3f276be024a9e0a1bcce5, source=<ip>,22101,1552916141124, destination=<ip>,22101,1552916140986
      2019-03-18 06:56:14,334 INFO org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock for pid=44, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure hri=819809c43ee3f276be024a9e0a1bcce5, source=<ip>,22101,1552916141124, destination=<ip>,22101,1552916140986
      2019-03-18 06:56:14,338 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=51, ppid=44, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=t11, region=819809c43ee3f276be024a9e0a1bcce5, override=true, server=<ip>,22101,1552916141124}]
      2019-03-18 06:56:14,344 INFO org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock for pid=51, ppid=44, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=t11, region=819809c43ee3f276be024a9e0a1bcce5, override=true, server=<ip>,22101,1552916141124
      2019-03-18 06:56:14,349 INFO org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=51 updating hbase:meta row=819809c43ee3f276be024a9e0a1bcce5, regionState=CLOSING, regionLocation=<ip>,22101,1552916141124
      2019-03-18 06:56:14,354 INFO org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Dispatch pid=51, ppid=44, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure table=t11, region=819809c43ee3f276be024a9e0a1bcce5, override=true, server=<ip>,22101,1552916141124
      2019-03-18 06:56:14,529 INFO org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=51 updating hbase:meta row=819809c43ee3f276be024a9e0a1bcce5, regionState=CLOSED
      

      Assign procedure on destination started:-

      2019-03-18 06:56:14,545 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished subprocedure pid=51, resume processing parent pid=44, state=RUNNABLE:MOVE_REGION_ASSIGN, locked=true; MoveRegionProcedure hri=819809c43ee3f276be024a9e0a1bcce5, source=<ip>,22101,1552916141124, destination=<ip>,22101,1552916140986
      2019-03-18 06:56:14,545 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=53, ppid=44, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=t11, region=819809c43ee3f276be024a9e0a1bcce5, target=<ip>,22101,1552916140986}]
      2019-03-18 06:56:14,545 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=51, ppid=44, state=SUCCESS; UnassignProcedure table=t11, region=819809c43ee3f276be024a9e0a1bcce5, override=true, server=<ip>,22101,1552916141124 in 195msec, unfinishedSiblingCount=1
      2019-03-18 06:56:14,550 INFO org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock for pid=53, ppid=44, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=t11, region=819809c43ee3f276be024a9e0a1bcce5, target=<ip>,22101,1552916140986
      2019-03-18 06:56:14,555 INFO org.apache.hadoop.hbase.master.assignment.AssignProcedure: Starting pid=53, ppid=44, state=RUNNABLE:REGION_TRANSITION_QUEUE, locked=true; AssignProcedure table=t11, region=819809c43ee3f276be024a9e0a1bcce5, target=<ip>,22101,1552916140986; rit=OFFLINE, location=<ip>,22101,1552916140986; forceNewPlan=false, retain=false
      2019-03-18 06:56:14,706 INFO org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=53 updating hbase:meta row=819809c43ee3f276be024a9e0a1bcce5, regionState=OPENING, regionLocation=<ip>,22101,1552916140986
      

      Before Assign completes, snapshot procedure started and failed:-

      2019-03-18 06:56:14,788 INFO org.apache.hadoop.hbase.procedure.Procedure: Starting procedure 'cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11'
      2019-03-18 06:56:14,830 INFO org.apache.hadoop.hbase.procedure.Procedure: Procedure 'cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11' execution completed
      2019-03-18 06:56:14,830 INFO org.apache.hadoop.hbase.procedure.ZKProcedureUtil: Clearing all znodes for procedure cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11including nodes /hbase/online-snapshot/acquired /hbase/online-snapshot/reached /hbase/online-snapshot/abort
      2019-03-18 06:56:14,849 INFO org.apache.hadoop.hbase.master.snapshot.EnabledTableSnapshotHandler: Done waiting - online snapshot for cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11
      2019-03-18 06:56:14,962 ERROR org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier: Snapshot { ss=cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11 table=t11 type=FLUSH } looks empty
      2019-03-18 06:56:14,963 ERROR org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler: Failed taking snapshot { ss=cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11 table=t11 type=FLUSH } due to exception:Snapshot { ss=cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11 table=t11 type=FLUSH } looks empty
      org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Snapshot { ss=cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11 table=t11 type=FLUSH } looks empty
      	at org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifyRegions(MasterSnapshotVerifier.java:172)
      	at org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifySnapshot(MasterSnapshotVerifier.java:119)
      	at org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.process(TakeSnapshotHandler.java:209)
      	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      2019-03-18 06:56:14,973 INFO org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler: Stop taking snapshot={ ss=cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11 table=t11 type=FLUSH } because: Failed taking snapshot { ss=cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11 table=t11 type=FLUSH } due to exception:Snapshot { ss=cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11 table=t11 type=FLUSH } looks empty
      
      

      Assign just completes after it:-

      2019-03-18 06:56:14,981 INFO org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=53 updating hbase:meta row=819809c43ee3f276be024a9e0a1bcce5, regionState=OPEN, openSeqNum=5, regionLocation=<ip>,22101,1552916140986
      2019-03-18 06:56:15,094 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=50, state=SUCCESS; org.apache.hadoop.hbase.master.locking.LockProcedure, tableName=t11, type=SHARED in 758msec
      

      Reproducible steps:-

       
      
      hbase(main):003:0> create 't1','f1'
      Created table t1
      
      hbase(main):029:0> get 'hbase:meta','t1,,1553548584756.a53f0cf0e3824955618974942463efd3.','info:state'
      COLUMN CELL
       info:state timestamp=1553548585347, value=OPEN
      
      hbase(main):006:0> unassign 't1,,1553548584756.a53f0cf0e3824955618974942463efd3.'
      
      hbase(main):029:0> get 'hbase:meta','t1,,1553548584756.a53f0cf0e3824955618974942463efd3.','info:state'
      COLUMN CELL
       info:state timestamp=1553548638199, value=CLOSED
      
      hbase(main):031:0> put 'hbase:meta','t1,,1553548584756.a53f0cf0e3824955618974942463efd3.','info:state','OPENING'
      Took 0.0830 seconds
      
      hbase(main):032:0> get 'hbase:meta','t1,,1553548584756.a53f0cf0e3824955618974942463efd3.','info:state'
      COLUMN CELL
       info:state timestamp=1553548863451, value=OPENING
      
      
      hbase(main):033:0> snapshot 't1','t1_a'
      
      ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot \{ ss=t1_a table=t1 type=FLUSH } had an error. t1_a not found in proclist []
       at org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:380)
       at org.apache.hadoop.hbase.master.MasterRpcServices.isSnapshotDone(MasterRpcServices.java:1136)
       at org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
       at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)
       at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
       at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
      Caused by: org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException via Failed taking snapshot \{ ss=t1_a table=t1 type=FLUSH } due to exception:Snapshot \{ ss=t1_a table=t1 type=FLUSH } looks empty:org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Snapshot \{ ss=t1_a table=t1 type=FLUSH } looks empty
       at org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:82)
       at org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:306)
       at org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:369)
       ... 6 more
      Caused by: org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Snapshot \{ ss=t1_a table=t1 type=FLUSH } looks empty
       at org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifyRegions(MasterSnapshotVerifier.java:172)
       at org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifySnapshot(MasterSnapshotVerifier.java:119)
       at org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.process(TakeSnapshotHandler.java:202)
       at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)
      

       

      We have some handling for OFFLINE region, probably we can do the same for other states like CLOSED/OPENING.

        Attachments

          Activity

            People

            • Assignee:
              ankit@apache.org Ankit Singhal
              Reporter:
              ankit@apache.org Ankit Singhal
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: