We recently had an issue where some bolts got really backed up and started to die from OOMs. The issue ended up being 2 fold.
First the GC really slowed down the worker so much that it could not keep up even with < 1% of the traffic that was still being sent to it. Which made it almost impossible to recover.
The second issue was that the serialization of the tuples took a lot longer than the processing, which resulted in the send queue filling up much more quickly than the receive queue.
To help fix this issue I plan to address this in 2 ways. First we need a better algorithm that can actually shut off the flow entirely to a very slow bolt and second we need to take the send queue into account when shuffling.
This is not a full set of changes needed by
STORM-2686 but it is a step in that direction. I am going to try and set it up so that the two algorithms would work nicely together.