Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-20796

STUCK RIT though region successfully assigned (hung RPC)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • 3.0.0-beta-2
    • amv2
    • None

    Description

      This is a good one. We keep logging messages like this:

      2018-06-26 12:32:24,859 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=vd0410.X.Y.com,22101,1529611445046, table=IntegrationTestBigLinkedList_20180525080406, region=e10b35d49528e2453a04c7038e3393d7
      

      ...though the region is successfully assigned.

      Story:

      • Dispatch an assign 2018-06-26 12:31:27,390 INFO org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Dispatch pid=370829, ppid=370391, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=IntegrationTestBigLinkedList_20180612114844, region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING, location=vd0410.X.Y.Z,22101,1529611445046
      • It gets stuck 2018-06-26 12:32:29,860 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=vd0410.X.Y.Z,22101,1529611445046, table=IntegrationTestBigLinkedList_20180612114844, region=f69ccf7d9178ce166b515e0e2ef019d2 (Because the server was killed)
      • We stay STUCK for a while.
      • The Master notices the server as crashed and starts a SCP.
      • SCP kills ongoing assign: 2018-06-26 12:32:54,809 INFO org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=371105 found RIT pid=370829, ppid=370391, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=IntegrationTestBigLinkedList_20180612114844, region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING, location=vd0410.X.Y.Z,22101,1529611445046
      • The kill brings on a retry ... 2018-06-26 12:32:54,810 WARN org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote call failed pid=370829, ppid=370391, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=IntegrationTestBigLinkedList_20180612114844, region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING, location=vd0410.X.Y.Z,22101,1529611445046; exception=ServerCrashProcedure pid=371105, server=vd0410.X.Y.Z,22101,1529611445046
      • Which eventually succeeds..... Successfully deployed to new server 2018-06-26 12:32:55,429 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=370829, ppid=370391, state=SUCCESS; AssignProcedure table=IntegrationTestBigLinkedList_20180612114844, region=f69ccf7d9178ce166b515e0e2ef019d2 in 1mins, 35.379sec
      • But then, it looks like the RPC was ongoing and it broke in following way 2018-06-26 12:33:06,378 WARN org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote call failed pid=370829, ppid=370391, state=SUCCESS; AssignProcedure table=IntegrationTestBigLinkedList_20180612114844, region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPEN, location=vc0614.halxg.cloudera.com,22101,1529611443424; exception=Call to vd0410.X.Y.Z/10.10.10.10:22101 failed on local exception: org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer (Notice how state for region is OPEN and 'SUCCESS').
      • Then says 2018-06-26 12:33:06,380 INFO org.apache.hadoop.hbase.master.assignment.AssignProcedure: Retry=1 of max=10; pid=370829, ppid=370391, state=SUCCESS; AssignProcedure table=IntegrationTestBigLinkedList_20180612114844, region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPEN, location=vc0614.X.Y.Z,22101,1529611443424
      • And finally... 2018-06-26 12:34:10,727 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OFFLINE, location=null, table=IntegrationTestBigLinkedList_20180612114844, region=f69ccf7d9178ce166b515e0e2ef019d2

      Restart of Master got rid of the STUCK complaints.

      This is interesting because the stuck rpc and the successful reassign are all riding on the same pid.

      Attachments

        1. HBASE-20796.branch-2.0.001.patch
          10 kB
          Michael Stack
        2. 0001-Test.patch
          17 kB
          Michael Stack

        Issue Links

          Activity

            People

              stack Michael Stack
              stack Michael Stack
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: