Uploaded image for project: 'Ratis'
  1. Ratis
  2. RATIS-1072

Should not shutdown and re-create channel/stub in GrpcServerProtocolClient when StreamObserver::onError() is called.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0
    • gRPC

    Description

      link title 

      When InstallSnapshotResponseHandler::onError() or AppendLogResponseHandler::onError() is called, it will:

      1. call rpcService.getProxies().resetProxy(getFollowerId()).
      2. set appendLogRequestObserver to be null.

      The recovery steps for this BiDi streaming call is actually: 

      1. call ManagedChannel.shutdown()
      2. re-create the channel from NettyChannelBuilder
      3. re-create the async stub from the channel.
      4. re-create a StreamObserver<OurRequest> from the async stub.

       

      Above steps is un-necessary and dangerous.

       

      It is un-necessary:

      according to the [reply from grpc|https://github.com/grpc/grpc-java/issues/7442]]

      Yes, you don't need to recreate the channel (and stub) for starting a new RPC call. The overhead for creating a channel is big.

      Receiving StreamObserver.onError() only indicates that specific RPC has been terminated. The connection should be still there. So you can definitely start new RPCs on the existing connection.

       

      It is dangerous:

      In multi-raft configuraiton, if two leaders sit in DN1, their followers sit in DN2, their steaming call will share the managed channel in the same GrpcServerProtocolClient that targeting from DN1 to DN2.

      If leader1 calls StreamObserver::onError(), it will shut down and re-create the underlying channel. In the same time, leader2's streaming call will be affected, and call StreamObserver::onError() to shut down and re-create the channel as well. The consequences is, the channel will be continually shutdown and re-create.

       

      trace from DN consisting of the leaders:

      We can see that, the channel targetting 172.16.0.69 is continually shutdown and re-create in every 3s.

      2020-09-18 19:47:04,025 [grpc-default-executor-93] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134495, target=172.16.90.69:9858}}.
      2020-09-18 19:47:07,025 [grpc-default-executor-115] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134499, target=172.16.90.69:9858}}.
      2020-09-18 19:47:10,026 [grpc-default-executor-106] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134503, target=172.16.90.69:9858}}.
      2020-09-18 19:47:13,026 [grpc-default-executor-115] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134507, target=172.16.90.69:9858}}.
      2020-09-18 19:47:16,027 [grpc-default-executor-106] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134511, target=172.16.90.69:9858}}.
      2020-09-18 19:47:19,027 [grpc-default-executor-114] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134532, target=172.16.90.69:9858}}.
      2020-09-18 19:47:22,027 [grpc-default-executor-105] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134540, target=172.16.90.69:9858}}.
      2020-09-18 19:47:25,028 [grpc-default-executor-104] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134544, target=172.16.90.69:9858}}.
      2020-09-18 19:47:28,028 [grpc-default-executor-117] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134548, target=172.16.90.69:9858}}.
      2020-09-18 19:47:31,028 [grpc-default-executor-114] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134552, target=172.16.90.69:9858}}.
       

      trace from DN consisting of the followers:

      Streams of different raft-group is continually affected. The (term, index) indicate that they belong to different raft-group, (t:9, i:8713), (t:7, i:12312),(t:3, i:213868),(t:10, i:12504), meanwhile raft peer id df173506-6978-47e5-8aef-034cafd221a6 means they are from leaders on the same DN.

      2020-09-18 19:51:49,498 [grpc-default-executor-113] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#320-t9, previous=(t:9, i:8713), leaderCommit=8714, initializing? false, entries: size=1, first=(t:9, i:8714), METADATAENTRY(c:8713): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
      2020-09-18 19:51:52,499 [grpc-default-executor-104] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#608-t7, previous=(t:7, i:12312), leaderCommit=12313, initializing? false, entries: size=1, first=(t:7, i:12313), METADATAENTRY(c:12312): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
      2020-09-18 19:51:52,499 [grpc-default-executor-109] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#1250591-t3, previous=(t:3, i:213868), leaderCommit=213869, initializing? false, entries: size=1, first=(t:3, i:213869), METADATAENTRY(c:213865): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
      2020-09-18 19:51:52,499 [grpc-default-executor-112] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#601-t10, previous=(t:10, i:12504), leaderCommit=12505, initializing? false, entries: size=1, first=(t:10, i:12505), METADATAENTRY(c:12504): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
      2020-09-18 19:51:55,499 [grpc-default-executor-109] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#785-t10, previous=(t:10, i:6660), leaderCommit=6661, initializing? false, entries: size=1, first=(t:10, i:6661), METADATAENTRY(c:6660): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
      2020-09-18 19:51:55,499 [grpc-default-executor-104] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#701-t6, previous=(t:6, i:9305), leaderCommit=9306, initializing? false, entries: size=1, first=(t:6, i:9306), METADATAENTRY(c:9305): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
      2020-09-18 19:51:58,499 [grpc-default-executor-114] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#611-t7, previous=(t:7, i:12312), leaderCommit=12313, initializing? false, entries: size=1, first=(t:7, i:12313), METADATAENTRY(c:12312): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
      2020-09-18 19:51:58,499 [grpc-default-executor-112] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#604-t10, previous=(t:10, i:12504), leaderCommit=12505, initializing? false, entries: size=1, first=(t:10, i:12505), METADATAENTRY(c:12504): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
      2020-09-18 19:51:58,499 [grpc-default-executor-113] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#324-t9, previous=(t:9, i:8713), leaderCommit=8714, initializing? false, entries: size=1, first=(t:9, i:8714), METADATAENTRY(c:8713): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
       

      The fix is simple 

      We can reuse the the channel and stub, just need step 4 to re-create the StreamObserver.

       

      Attachments

        Issue Links

          Activity

            People

              glengeng Glen Geng
              glengeng Glen Geng
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 50m
                  2h 50m