In our large busy yarn cluster which deploy Spark external shuffle service as part of YARN NM aux service, we encountered OOM in some NMs.
after i dump the heap memory and found there are some StremState objects still in heap, but the app which the StreamState belongs to is already finished.
Here is some relate Figures:
The heap dump below shows that the memory consumption mainly consists of two parts:
(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)
(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )
dig into the OneForOneStreamManager, there are some StreaStates still remained :
incomming references to StreamState::associatedChannel: