Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-2786

Ackers leak tracking info on failure and lots of other cases.

    Details

      Description

      Over the weekend we had an incident where ackers were running out of memory at a really scary rate. It turns out that they were having a lot of failures, for an unrelated reason, but each of the failures were resulting in tuple tracking being lost because...

      We don't send ticks to any system components ever...

      https://github.com/apache/storm/blob/124acb92dff04a57b530ab4d95a698abc8ff46d9/storm-client/src/jvm/org/apache/storm/executor/Executor.java#L384

      and ackers are system components.

      So the tracking map was never rotated and all failed tuples

      https://github.com/apache/storm/blob/124acb92dff04a57b530ab4d95a698abc8ff46d9/storm-client/src/jvm/org/apache/storm/daemon/Acker.java#L97-L103

      Were never deleted from the map.

      This leak eventually made the ackers crash, and when they came back up the other components kept blasting them with messages that would never be fully acked which also leaked because of the tick problem.

      Looking back this has been in every release since 0.9.1-incubating. It appears to have been introduced by https://github.com/apache/storm/commit/483ce454a3b2cd31b5d1c34e9365346459b358a8

      So every apache release has this problem (which is the only reason I have not marked this as a blocker, because apparently it is not so bad that anyone has noticed in the past 4 years).

        Attachments

          Activity

            People

            • Assignee:
              revans2 Robert Joseph Evans
              Reporter:
              revans2 Robert Joseph Evans
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h 40m
                1h 40m