Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
When InstallSnapshotResponseHandler::onError() or AppendLogResponseHandler::onError() is called, it will:
- call rpcService.getProxies().resetProxy(getFollowerId()).
- set appendLogRequestObserver to be null.
The recovery steps for this BiDi streaming call is actually:
- call ManagedChannel.shutdown()
- re-create the channel from NettyChannelBuilder
- re-create the async stub from the channel.
- re-create a StreamObserver<OurRequest> from the async stub.
Above steps is un-necessary and dangerous.
It is un-necessary:
according to the [reply from grpc|https://github.com/grpc/grpc-java/issues/7442]]
Yes, you don't need to recreate the channel (and stub) for starting a new RPC call. The overhead for creating a channel is big.
Receiving StreamObserver.onError() only indicates that specific RPC has been terminated. The connection should be still there. So you can definitely start new RPCs on the existing connection.
It is dangerous:
In multi-raft configuraiton, if two leaders sit in DN1, their followers sit in DN2, their steaming call will share the managed channel in the same GrpcServerProtocolClient that targeting from DN1 to DN2.
If leader1 calls StreamObserver::onError(), it will shut down and re-create the underlying channel. In the same time, leader2's streaming call will be affected, and call StreamObserver::onError() to shut down and re-create the channel as well. The consequences is, the channel will be continually shutdown and re-create.
trace from DN consisting of the leaders:
We can see that, the channel targetting 172.16.0.69 is continually shutdown and re-create in every 3s.
2020-09-18 19:47:04,025 [grpc-default-executor-93] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134495, target=172.16.90.69:9858}}.
2020-09-18 19:47:07,025 [grpc-default-executor-115] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134499, target=172.16.90.69:9858}}.
2020-09-18 19:47:10,026 [grpc-default-executor-106] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134503, target=172.16.90.69:9858}}.
2020-09-18 19:47:13,026 [grpc-default-executor-115] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134507, target=172.16.90.69:9858}}.
2020-09-18 19:47:16,027 [grpc-default-executor-106] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134511, target=172.16.90.69:9858}}.
2020-09-18 19:47:19,027 [grpc-default-executor-114] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134532, target=172.16.90.69:9858}}.
2020-09-18 19:47:22,027 [grpc-default-executor-105] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134540, target=172.16.90.69:9858}}.
2020-09-18 19:47:25,028 [grpc-default-executor-104] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134544, target=172.16.90.69:9858}}.
2020-09-18 19:47:28,028 [grpc-default-executor-117] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134548, target=172.16.90.69:9858}}.
2020-09-18 19:47:31,028 [grpc-default-executor-114] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper{delegate=ManagedChannelImpl{logId=134552, target=172.16.90.69:9858}}.
trace from DN consisting of the followers:
Streams of different raft-group is continually affected. The (term, index) indicate that they belong to different raft-group, (t:9, i:8713), (t:7, i:12312),(t:3, i:213868),(t:10, i:12504), meanwhile raft peer id df173506-6978-47e5-8aef-034cafd221a6 means they are from leaders on the same DN.
2020-09-18 19:51:49,498 [grpc-default-executor-113] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#320-t9, previous=(t:9, i:8713), leaderCommit=8714, initializing? false, entries: size=1, first=(t:9, i:8714), METADATAENTRY(c:8713): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2020-09-18 19:51:52,499 [grpc-default-executor-104] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#608-t7, previous=(t:7, i:12312), leaderCommit=12313, initializing? false, entries: size=1, first=(t:7, i:12313), METADATAENTRY(c:12312): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2020-09-18 19:51:52,499 [grpc-default-executor-109] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#1250591-t3, previous=(t:3, i:213868), leaderCommit=213869, initializing? false, entries: size=1, first=(t:3, i:213869), METADATAENTRY(c:213865): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2020-09-18 19:51:52,499 [grpc-default-executor-112] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#601-t10, previous=(t:10, i:12504), leaderCommit=12505, initializing? false, entries: size=1, first=(t:10, i:12505), METADATAENTRY(c:12504): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2020-09-18 19:51:55,499 [grpc-default-executor-109] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#785-t10, previous=(t:10, i:6660), leaderCommit=6661, initializing? false, entries: size=1, first=(t:10, i:6661), METADATAENTRY(c:6660): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2020-09-18 19:51:55,499 [grpc-default-executor-104] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#701-t6, previous=(t:6, i:9305), leaderCommit=9306, initializing? false, entries: size=1, first=(t:6, i:9306), METADATAENTRY(c:9305): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2020-09-18 19:51:58,499 [grpc-default-executor-114] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#611-t7, previous=(t:7, i:12312), leaderCommit=12313, initializing? false, entries: size=1, first=(t:7, i:12313), METADATAENTRY(c:12312): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2020-09-18 19:51:58,499 [grpc-default-executor-112] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#604-t10, previous=(t:10, i:12504), leaderCommit=12505, initializing? false, entries: size=1, first=(t:10, i:12505), METADATAENTRY(c:12504): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
2020-09-18 19:51:58,499 [grpc-default-executor-113] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#324-t9, previous=(t:9, i:8713), leaderCommit=8714, initializing? false, entries: size=1, first=(t:9, i:8714), METADATAENTRY(c:8713): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
The fix is simple
We can reuse the the channel and stub, just need step 4 to re-create the StreamObserver.
Attachments
Issue Links
- is duplicated by
-
RATIS-878 Infinite restart of GrpcLogAppender
- Resolved
-
RATIS-1076 Should not shutdown and re-create grpc channel/stub in GrpcClientProtocolClient.
- Resolved
- links to