Description
When a node tries to reconnects to another node in a scale down cluster, the reconnect request gets denied by the other node and keeps retrying, which causes tasks in the ordered executor accumulate and eventually OOM.
To reproduce:
- Start 2 nodes (node1 and 2) cluster configured in scale down mode.
- stop node2 and restart it.
- node1 will try to reconnect to node2 repeatedly and ever succeed.
- Inspect the connecting ClientSessionFactory (like adding log) and its threadpool (closeExecutor an object of OrderedExecutor) keeps adding tasks to its queue.
Over the time the queue keeps ever growing, and will exhaust the heap memory.
Attachments
Issue Links
- links to