Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-20552

HBase RegionServer was shutdown due to UnexpectedStateException

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Critical
    • Resolution: Unresolved
    • 2.0.0
    • None
    • None
    • None

    Description

      This was observed during cluster testing (source code sync'ed with hbase-2.0, built May 2nd):

      2018-05-02 05:44:10,089 ERROR [RpcServer.default.FPBQ.Fifo.handler=28,queue=1,port=20000] master.MasterRpcServices: Region server ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 reported a fatal error:
      ***** ABORTING region server ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474: org.apache.hadoop.hbase.YouAreDeadException: rit=OPEN, location=ctr-e138-             1518143905142-279227-01-000007.hwx.site,16020,1525239609353, table=test_hbase_ha_load_test_tool_hbase, region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on server=ctr-e138-          1518143905142-279227-01-000002.hwx.site,16020,1525239334474 but state has otherwise.
        at org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1065)
        at org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportOnlineRegions(AssignmentManager.java:987)
        at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:459)
        at org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:13118)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
      Caused by: org.apache.hadoop.hbase.exceptions.UnexpectedStateException: rit=OPEN, location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353,                         table=test_hbase_ha_load_test_tool_hbase, region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on server=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 but state  has otherwise.
        at org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1037)
        ... 7 more
       *****
      Cause:
      org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: rit=OPEN, location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353,       table=test_hbase_ha_load_test_tool_hbase, region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on server=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 but state  has otherwise.
        at org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1065)
        at org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportOnlineRegions(AssignmentManager.java:987)
        at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:459)
        at org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:13118)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
      Caused by: org.apache.hadoop.hbase.exceptions.UnexpectedStateException: rit=OPEN, location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353,                         table=test_hbase_ha_load_test_tool_hbase, region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on server=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 but state  has otherwise.
        at org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1037)
        ... 7 more
      
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      

      elserj and I (Ted) did some initial analysis.

      In the following description, M1 refers to master-ctr-e138-1518143905142-279227-01-000005 and M2 refers to master-ctr-e138-1518143905142-279227-01-000003.

      Let's follow region 94f6ca283dbb4445b2bcdc321b734d28 .

      Master 1 was moving the region to 000007:

      2018-05-02 05:38:59,017 INFO  [master/ctr-e138-1518143905142-279227-01-000005:20000.Chore.1] master.HMaster: balance hri=94f6ca283dbb4445b2bcdc321b734d28, source=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, destination=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502
      2018-05-02 05:38:59,126 DEBUG [master/ctr-e138-1518143905142-279227-01-000005:20000.Chore.1] procedure2.ProcedureExecutor: Stored pid=465, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure hri=94f6ca283dbb4445b2bcdc321b734d28, source=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, destination=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502
      

      Region was closed on 000002:

      2018-05-02 05:38:59,327 INFO  [RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-279227-01-000002:16020-1] regionserver.HRegion: Closed test_hbase_ha_load_test_tool_hbase,1eb851e9,1525238855422.94f6ca283dbb4445b2bcdc321b734d28.
      2018-05-02 05:38:59,327 INFO  [RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-279227-01-000002:16020-1] regionserver.HRegionServer: Adding 94f6ca283dbb4445b2bcdc321b734d28 move to ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502 record at close sequenceid=13395
      2018-05-02 05:38:59,329 DEBUG [RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-279227-01-000002:16020-1] handler.CloseRegionHandler: Closed test_hbase_ha_load_test_tool_hbase,1eb851e9,1525238855422.94f6ca283dbb4445b2bcdc321b734d28.
      

      The master saw the destination crash:

      2018-05-02 05:39:44,792 INFO  [main-EventThread] master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502]
      2018-05-02 05:39:44,793 INFO  [main-EventThread] master.ServerManager: Processing expiration of ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502 on ctr-e138-1518143905142-279227-01-000005.hwx.site,20000,1525238791399
      

      Processed it:

      2018-05-02 05:39:45,616 DEBUG [PEWorker-16] procedure.ServerCrashProcedure: Done splitting WALs pid=471, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure server=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502, splitWal=true, meta=true
      2018-05-02 05:39:45,620 INFO  [PEWorker-16] procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=474, ppid=471, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=test_hbase_ha_load_test_tool_hbase, region=670f6b815d2acac905130e5440d59304}, {pid=475, ppid=471, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=test_hbase_ha_load_test_tool_hbase, region=94f6ca283dbb4445b2bcdc321b734d28}, {pid=476, ppid=471, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=test_hbase_ha_load_test_tool_hbase, region=1595f38ee901be7c67b997fe2fc95951}, {pid=477, ppid=471, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=test_hbase_ha_load_test_tool_hbase, region=a0620fc83de532a37f6a9bb8f99cc6c4}, {pid=478, ppid=471, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=test_hbase_ha_load_test_tool_hbase, region=e75a388bc2011feed75bdc1a0e99a9a9}, {pid=479, ppid=471, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=test_hbase_ha_load_test_tool_namenode, region=c963eb77dbdc6dbab886dbe4eebba5ad}, {pid=480, ppid=471, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=test_hbase_ha_load_test_tool_namenode, region=f3db9f9879ed03f488dcb89bea834237}, {pid=481, ppid=471, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=test_hbase_ha_load_test_tool_namenode, region=b5180eee96b616afdf79578309c66a11}, {pid=482, ppid=471, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=test_hbase_ha_load_test_tool_namenode, region=a6e0d7561c4f19e78f94d37462588281}, {pid=483, ppid=471, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=test_hbase_ha_load_test_tool_namenode, region=1d954f21d711345a9587d995cecea136}, {pid=484, ppid=471, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=test_hbase_ha_load_test_tool_namenode, region=c078deb2474e9c19b85b5fdb9efaa47d}, {pid=485, ppid=471, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=test_hbase_ha_load_test_tool_namenode, region=91f73e76bbe7bc8a61b1b1299d34c6ab}, {pid=486, ppid=471, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=ATLAS_ENTITY_AUDIT_EVENTS, region=8dc6fd2022c2fdf8c065fbd16cadaaca}]
      

      Master 1 tried to assign the region back on the original RS (that we tried to move it off of):

      2018-05-02 05:39:46,136 INFO  [PEWorker-1] assignment.RegionTransitionProcedure: Dispatch pid=475, ppid=471, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure                  table=test_hbase_ha_load_test_tool_hbase, region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, location=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474
      

      Then Master 1 went down, Master 2 picked up the region.

      M2 ran a MoveProcedure where it thought region was on 007:

      2018-05-02 05:43:33,876 DEBUG [master/ctr-e138-1518143905142-279227-01-000003:20000] procedure2.ProcedureExecutor: Completed pid=465, state=SUCCESS; MoveRegionProcedure hri=94f6ca283dbb4445b2bcdc321b734d28, source=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, destination=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502
      

      and assigned region to 007:

      2018-05-02 05:44:08,464 INFO  [PEWorker-6] procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=508, ppid=507, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=test_hbase_ha_load_test_tool_hbase, region=94f6ca283dbb4445b2bcdc321b734d28}]
      2018-05-02 05:44:08,472 INFO  [PEWorker-6] procedure.MasterProcedureScheduler: pid=508, ppid=507, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=test_hbase_ha_load_test_tool_hbase, region=94f6ca283dbb4445b2bcdc321b734d28 checking lock on 94f6ca283dbb4445b2bcdc321b734d28
      2018-05-02 05:44:08,480 INFO  [PEWorker-6] assignment.AssignProcedure: Starting pid=508, ppid=507, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=test_hbase_ha_load_test_tool_hbase, region=94f6ca283dbb4445b2bcdc321b734d28; rit=OFFLINE, location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502; forceNewPlan=false, retain=true
      2018-05-02 05:44:08,646 INFO  [master/ctr-e138-1518143905142-279227-01-000003:20000] balancer.BaseLoadBalancer: Reassigned 1 regions. 1 retained the pre-restart assignment. 
      2018-05-02 05:44:08,659 INFO  [PEWorker-11] assignment.RegionStateStore: pid=508 updating hbase:meta row=94f6ca283dbb4445b2bcdc321b734d28, regionState=OPENING, regionLocation=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
      2018-05-02 05:44:08,727 INFO  [PEWorker-11] assignment.RegionTransitionProcedure: Dispatch pid=508, ppid=507, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=test_hbase_ha_load_test_tool_hbase, region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
      2018-05-02 05:44:08,768 INFO  [master/ctr-e138-1518143905142-279227-01-000003:20000] master.HMaster: Master has completed initialization 41.430sec
      2018-05-02 05:44:09,213 DEBUG [RpcServer.default.FPBQ.Fifo.handler=28,queue=1,port=20000] assignment.RegionTransitionProcedure: Received report OPENED seqId=13402, pid=508, ppid=507, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=test_hbase_ha_load_test_tool_hbase, region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
      2018-05-02 05:44:09,213 DEBUG [PEWorker-12] assignment.RegionTransitionProcedure: Finishing pid=508, ppid=507, state=RUNNABLE:REGION_TRANSITION_FINISH; AssignProcedure table=test_hbase_ha_load_test_tool_hbase, region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
      2018-05-02 05:44:09,214 INFO  [PEWorker-12] assignment.RegionStateStore: pid=508 updating hbase:meta row=94f6ca283dbb4445b2bcdc321b734d28, regionState=OPEN, openSeqNum=13402, regionLocation=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
      2018-05-02 05:44:09,258 INFO  [PEWorker-12] procedure2.ProcedureExecutor: Finished subprocedure(s) of pid=507, state=RUNNABLE:SERVER_CRASH_HANDLE_RIT2; ServerCrashProcedure server=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502, splitWal=true, meta=false; resume parent processing.
      2018-05-02 05:44:09,258 INFO  [PEWorker-12] procedure2.ProcedureExecutor: Finished pid=508, ppid=507, state=SUCCESS; AssignProcedure table=test_hbase_ha_load_test_tool_hbase, region=94f6ca283dbb4445b2bcdc321b734d28 in 764msec
      2018-05-02 05:44:09,273 INFO  [PEWorker-14] procedure2.ProcedureExecutor: Finished pid=507, state=SUCCESS; ServerCrashProcedure server=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502, splitWal=true, meta=false in 975msec
      

      So when 000002 reported region to master 2, we got the exception.

      Attachments

        Activity

          People

            uagashe Umesh Agashe
            romil.choksi Romil Choksi
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated: