Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
0.9.1-incubating, 0.10.0, 1.0.0, 2.0.0
Description
Over the weekend we had an incident where ackers were running out of memory at a really scary rate. It turns out that they were having a lot of failures, for an unrelated reason, but each of the failures were resulting in tuple tracking being lost because...
We don't send ticks to any system components ever...
and ackers are system components.
So the tracking map was never rotated and all failed tuples
Were never deleted from the map.
This leak eventually made the ackers crash, and when they came back up the other components kept blasting them with messages that would never be fully acked which also leaked because of the tick problem.
Looking back this has been in every release since 0.9.1-incubating. It appears to have been introduced by https://github.com/apache/storm/commit/483ce454a3b2cd31b5d1c34e9365346459b358a8
So every apache release has this problem (which is the only reason I have not marked this as a blocker, because apparently it is not so bad that anyone has noticed in the past 4 years).