Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-20828 Finish-up AMv2 Design/List of Tenets/Specification of operation
  3. HBASE-20864

RS was killed due to master thought the region should be on a already dead server

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Resolved
    • 2.0.0
    • None
    • None
    • None

    Description

      When I was running ITBLL with our internal 2.0.0 version(with 2.0.1 backported and with other two issues: HBASE-20706, HBASE-20752). I found two of my RS killed by master since master has a different region state with those RS. It is very strange that master thought these region should be on a already dead server. There might be a serious bug, but I haven't found it yet. Here is the process:

      1. e010125048153.bja,60020,1531137365840 is crashed, and clearly 4423e4182457c5b573729be4682cc3a3 was assigned to e010125049164.bja,60020,1531136465378 during ServerCrashProcedure

      2018-07-09 20:03:32,443 INFO  [PEWorker-10] procedure.ServerCrashProcedure: Start pid=2303, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure server=e010125048153.bja,60020,1531137365840, splitWal=true, meta=false
      2018-07-09 20:03:39,220 DEBUG [RpcServer.default.FPBQ.Fifo.handler=294,queue=24,port=60000] assignment.RegionTransitionProcedure: Received report OPENED seqId=16021, pid=2305, ppid=2303, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=IntegrationTestBigLinkedList, region=4423e4182457c5b573729be4682cc3a3; rit=OPENING, location=e010125049164.bja,60020,1531136465378
      2018-07-09 20:03:39,220 INFO  [PEWorker-13] assignment.RegionStateStore: pid=2305 updating hbase:meta row=4423e4182457c5b573729be4682cc3a3, regionState=OPEN, openSeqNum=16021, regionLocation=e010125049164.bja,60020,1531136465378
      2018-07-09 20:03:43,190 INFO  [PEWorker-12] procedure2.ProcedureExecutor: Finished pid=2303, state=SUCCESS; ServerCrashProcedure server=e010125048153.bja,60020,1531137365840, splitWal=true, meta=false in 10.7490sec
      

      2. A modify table happened later, the 4423e4182457c5b573729be4682cc3a3 was reopend on e010125049164.bja,60020,1531136465378

      2018-07-09 20:04:39,929 DEBUG [RpcServer.default.FPBQ.Fifo.handler=295,queue=25,port=60000] assignment.RegionTransitionProcedure: Received report OPENED seqId=16024, pid=2351, ppid=2314, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=IntegrationTestBigLinkedList, region=4423e4182457c5b573729be4682cc3a3, target=e010125049164.bja,60020,1531136465378; rit=OPENING, location=e010125049164.bja,60020,1531136465378
      2018-07-09 20:04:40,554 INFO  [PEWorker-6] assignment.RegionStateStore: pid=2351 updating hbase:meta row=4423e4182457c5b573729be4682cc3a3, regionState=OPEN, openSeqNum=16024, regionLocation=e010125049164.bja,60020,1531136465378
      

      3. Active master was killed, the backup master took over, but when loading meta entry, it clearly showed 4423e4182457c5b573729be4682cc3a3 is on the privous dead server e010125048153.bja,60020,1531137365840. That is very very strange!!!

      2018-07-09 20:06:17,985 INFO  [master/e010125048016:60000] assignment.RegionStateStore: Load hbase:meta entry region=4423e4182457c5b573729be4682cc3a3, regionState=OPEN, lastHost=e010125049164.bja,60020,1531136465378, regionLocation=e010125048153.bja,60020,1531137365840, openSeqNum=16024
      

      4. the rs was killed

      2018-07-09 20:06:20,265 WARN  [RpcServer.default.FPBQ.Fifo.handler=297,queue=27,port=60000] assignment.AssignmentManager: Killing e010125049164.bja,60020,1531136465378: rit=OPEN, location=e010125048153.bja,60020,1531137365840, table=IntegrationTestBigLinkedList, region=4423e4182457c5b573729be4682cc3a3reported OPEN on server=e010125049164.bja,60020,1531136465378 but state has otherwise.
      

      Attachments

        1. log.zip
          348 kB
          Allan Yang

        Issue Links

          Activity

            People

              allan163 Allan Yang
              allan163 Allan Yang
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: