Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-19998

Flakey TestVisibilityLabelsWithDefaultVisLabelService

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0-beta-2, 2.0.0
    • test
    • None

    Description

      This is a good one. Its a timeout and though it has lots of test methods, the problem is one of them gets stuck. The test method kills a RegionServers then starts a new one. Usually all works out fine but the odd time there is this unexplained MOVE that gets interjected just as ServerCrashProcedure starts up. hbase:meta gets stuck (perhaps this is what is being referred to here: https://issues.apache.org/jira/browse/HBASE-19929?focusedCommentId=16356906&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16356906). It is trying to run the MOVE by first unassigning from the server that has just crashed. It never succeeds. Need to fix this. Need to figure where these Move operations are coming from too. Let me add some debug.

      See here how we are well into ServerCrashProcedure... and then two MOVEs cut-in... for hbase:meta and for namespace:

      ....
      2018-02-14 02:35:19,806 DEBUG [PEWorker-6] procedure.ServerCrashProcedure(192): pid=10, state=RUNNABLE:SERVER_CRASH_PROCESS_META; ServerCrashProcedure server=asf903.gq1.ygridcore.net,59608,1518575711969, splitWal=true, meta=true; Processing hbase:meta that was on asf903.gq1.ygridcore.net,59608,1518575711969
      2018-02-14 02:35:19,807 INFO  [PEWorker-6] procedure2.ProcedureExecutor(1498): Initialized subprocedures=[{pid=12, ppid=10, state=RUNNABLE:RECOVER_META_SPLIT_LOGS; RecoverMetaProcedure failedMetaServer=asf903.gq1.ygridcore.net,59608,1518575711969, splitWal=true}]
      2018-02-14 02:35:19,811 DEBUG [Thread-214] procedure2.ProcedureExecutor(868): Stored pid=11, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure hri=hbase:meta,,1.1588230740, source=asf903.gq1.ygridcore.net,59608,1518575711969, destination=
      2018-02-14 02:35:19,813 INFO  [PEWorker-8] procedure.MasterProcedureScheduler(813): pid=11, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure hri=hbase:meta,,1.1588230740, source=asf903.gq1.ygridcore.net,59608,1518575711969, destination= hbase:meta hbase:meta,,1.1588230740
      2018-02-14 02:35:19,814 INFO  [PEWorker-8] procedure2.ProcedureExecutor(1498): Initialized subprocedures=[{pid=14, ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=hbase:meta, region=1588230740, server=asf903.gq1.ygridcore.net,59608,1518575711969}]
      2018-02-14 02:35:19,831 DEBUG [Thread-214] procedure2.ProcedureExecutor(868): Stored pid=13, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure hri=hbase:namespace,,1518575716296.e52a160b3f3a57ab50d710eba62d9b15., source=asf903.gq1.ygridcore.net,59608,1518575711969, destination=
      2018-02-14 02:35:19,833 INFO  [PEWorker-10] procedure.MasterProcedureScheduler(813): pid=13, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure hri=hbase:namespace,,1518575716296.e52a160b3f3a57ab50d710eba62d9b15., source=asf903.gq1.ygridcore.net,59608,1518575711969, destination= hbase:namespace hbase:namespace,,1518575716296.e52a160b3f3a57ab50d710eba62d9b15.
      2018-02-14 02:35:19,837 INFO  [PEWorker-10] procedure2.ProcedureExecutor(1498): Initialized subprocedures=[{pid=15, ppid=13, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=hbase:namespace, region=e52a160b3f3a57ab50d710eba62d9b15, server=asf903.gq1.ygridcore.net,59608,1518575711969}]
      ....
      

      Here is the failure of the unassign:

      2018-02-14 02:35:19,944 WARN  [PEWorker-8] assignment.RegionTransitionProcedure(187): Remote call failed pid=14, ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=hbase:meta, region=1588230740, server=asf903.gq1.ygridcore.net,59608,1518575711969; rit=CLOSING, location=asf903.gq1.ygridcore.net,59608,1518575711969; exception=pid=14, ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=hbase:meta, region=1588230740, server=asf903.gq1.ygridcore.net,59608,1518575711969 to asf903.gq1.ygridcore.net,59608,1518575711969
      2018-02-14 02:35:19,945 WARN  [PEWorker-8] assignment.UnassignProcedure(245): Expiring server pid=14, ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=hbase:meta, region=1588230740, server=asf903.gq1.ygridcore.net,59608,1518575711969; rit=CLOSING, location=asf903.gq1.ygridcore.net,59608,1518575711969, exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException: pid=14, ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=hbase:meta, region=1588230740, server=asf903.gq1.ygridcore.net,59608,1518575711969 to asf903.gq1.ygridcore.net,59608,1518575711969
      2018-02-14 02:35:19,945 WARN  [PEWorker-8] master.ServerManager(579): Expiration of asf903.gq1.ygridcore.net,59608,1518575711969 but server shutdown already in progress
      2018-02-14 02:35:19,949 DEBUG [RS-EventLoopGroup-4-22] ipc.FailedServers(56): Added failed server with address asf903.gq1.ygridcore.net/67.195.81.147:59608 to list caused by org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: Connection refused: asf903.gq1.ygridcore.net/67.195.81.147:59608
      2018-02-14 02:35:20,064 DEBUG [MultiHConnection-shared--pool8-t5] ipc.AbstractRpcClient(349): Not trying to connect to asf903.gq1.ygridcore.net/67.195.81.147:59608 this server is in the failed servers list
      ...
      

      ... and we keep trying:

      2018-02-14 02:35:22,093 INFO  [MultiHConnection-shared--pool8-t5] client.AsyncRequestFutureImpl(767): id=6, table=hbase:meta, attempt=6/46 failed=1ops, last exception=java.net.ConnectException: Call to asf903.gq1.ygridcore.net/67.195.81.147:59608 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: Connection refused: asf903.gq1.ygridcore.net/67.195.81.147:59608 on asf903.gq1.ygridcore.net,59608,1518575711969, tracking started null, retrying after=2000ms, replay=1ops
      2018-02-14 02:35:24,095 DEBUG [MultiHConnection-shared--pool8-t5] ipc.NettyRpcConnection(254): Connecting to asf903.gq1.ygridcore.net/67.195.81.147:59608
      2018-02-14 02:35:24,110 DEBUG [RS-EventLoopGroup-4-24] ipc.FailedServers(56): Added failed server with address asf903.gq1.ygridcore.net/67.195.81.147:59608 to list caused by org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: Connection refused: asf903.gq1.ygridcore.net/67.195.81.147:59608
      2018-02-14 02:35:24,112 INFO  [MultiHConnection-shared--pool8-t5] client.AsyncRequestFutureImpl(767): id=6, table=hbase:meta, attempt=7/46 failed=1ops, last exception=java.net.ConnectException: Call to asf903.gq1.ygridcore.net/67.195.81.147:59608 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: Connection refused: asf903.gq1.ygridcore.net/67.195.81.147:59608 on asf903.gq1.ygridcore.net,59608,1518575711969, tracking started null, retrying after=4005ms, replay=1ops
      2018-02-14 02:35:28,120 DEBUG [MultiHConnection-shared--pool8-t5] ipc.NettyRpcConnection(254): Connecting to asf903.gq1.ygridcore.net/67.195.81.147:59608
      

      Test set: org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService
      -------------------------------------------------------------------------------
      Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 600.83 s <<< FAILURE! - in org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService
      org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService Time elapsed: 583.188 s <<< ERROR!
      org.junit.runners.model.TestTimedOutException: test timed out after 600 seconds
      at org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService.testAddVisibilityLabelsOnRSRestart(TestVisibilityLabelsWithDefaultVisLabelService.java:126)

      org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService Time elapsed: 583.227 s <<< ERROR!
      java.lang.Exception: Appears to be stuck in thread RS-EventLoopGroup-7-2

      Attachments

        1. hbase-19988.master.001.patch
          2 kB
          Michael Stack
        2. HBASE-19998.master.001.patch
          8 kB
          Michael Stack

        Issue Links

          Activity

            People

              stack Michael Stack
              stack Michael Stack
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: