Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
None
-
None
-
None
Description
What's the problem ?
When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I found there are 460710 instances of GrpcLogAppender. But there are only 6 instances of SenderList, and each SenderList contains 1-2 instance of GrpcLogAppender. And there are a lot of logs related to LeaderState::restartSender.
INFO impl.RaftServerImpl: 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: Restarting GrpcLogAppender for 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}
So there are a lot of GrpcLogAppender did not stop the Daemon Thread when removed from senders.
Why LeaderState::restartSender so many times ?
1. As the image shows, when remove group, SegmentedRaftLog will close, then GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. Then GrpcLogAppender will be restarted, and the new GrpcLogAppender throw exception again when find the SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... . It results in an infinite restart of GrpcLogAppender.
2. Actually, when remove group, GrpcLogAppender will be stoped: RaftServerImpl::shutdown -> RoleInfo::shutdownLeaderState -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog will be closed: RaftServerImpl::shutdown -> ServerState:close ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, but the GrpcLogAppender was stopped asynchronously. So infinite restart of GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog close.
More details please refer it here RATIS-840.
Attachments
Attachments
Issue Links
- duplicates
-
RATIS-1761 If LeaderStateImpl is not running, it should not restart a LogAppender.
- Resolved
-
RATIS-1072 Should not shutdown and re-create channel/stub in GrpcServerProtocolClient when StreamObserver::onError() is called.
- Resolved
- is a child of
-
RATIS-840 Memory leak of LogAppender
- Resolved