Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-20137

TestRSGroups is flakey

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 2.0.0-beta-2
    • None
    • test
    • None

    Description

      It was the single test that failed the hbase-2 nightlies in #440 at the hadoop2 stage.

      The failure manifests as a timeout. It actually has an interesting cause calling into question some of the clauses in UnassignProcedure#remoteCallFailed.

      We are running a disabletable concurrent with a shutdown. pid=309 is the disable. pid=311 is the interesting one. The below is a little hard to read – the exception 'message' is the the current procedure as a String... hard to parse, fixing – but we are trying to unassign as part of a the disabletable. Our RPC fails because the server we are trying to rpc too is currently being processed as crashed (pid=308 is a servercrashprocedure for this server). As part of the processing of the failed RPC we will expire the server – if we can't RPC to it, it must be gone. The current procedure is then suspended until it gets woken up by the servercrashprocedure triggered by the expire.... only in this case we are shutting down so the expire is ignored... The current procedure is left in its suspend state. This prevents the Master going down. So we time out.

      2018-03-05 11:29:22,507 INFO [PEWorker-13] assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524; rit=CLOSING, location=1cfd208ff882,40584,1520249102524
      2018-03-05 11:29:22,508 WARN [PEWorker-13] assignment.RegionTransitionProcedure(187): Remote call failed pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524; rit=CLOSING, location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
      2018-03-05 11:29:22,508 WARN [PEWorker-13] assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524; rit=CLOSING, location=1cfd208ff882,40584,1520249102524, exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException: pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
      2018-03-05 11:29:22,508 WARN [PEWorker-13] master.ServerManager(580): Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in progress

      I need to cater for case where the expire server is rejected.

      Attachments

        1. HBASE-20137.branch-2.001.patch
          10 kB
          Michael Stack
        2. HBASE-20137.branch-2.002.patch
          13 kB
          Michael Stack
        3. HBASE-20137.branch-2.003.patch
          13 kB
          Michael Stack
        4. HBASE-20137.branch-2.003.patch
          13 kB
          Michael Stack

        Issue Links

          Activity

            People

              stack Michael Stack
              stack Michael Stack
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: