Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-2786

Ackers leak tracking info on failure and lots of other cases.

    XMLWordPrintableJSON

Details

    Description

      Over the weekend we had an incident where ackers were running out of memory at a really scary rate. It turns out that they were having a lot of failures, for an unrelated reason, but each of the failures were resulting in tuple tracking being lost because...

      We don't send ticks to any system components ever...

      https://github.com/apache/storm/blob/124acb92dff04a57b530ab4d95a698abc8ff46d9/storm-client/src/jvm/org/apache/storm/executor/Executor.java#L384

      and ackers are system components.

      So the tracking map was never rotated and all failed tuples

      https://github.com/apache/storm/blob/124acb92dff04a57b530ab4d95a698abc8ff46d9/storm-client/src/jvm/org/apache/storm/daemon/Acker.java#L97-L103

      Were never deleted from the map.

      This leak eventually made the ackers crash, and when they came back up the other components kept blasting them with messages that would never be fully acked which also leaked because of the tick problem.

      Looking back this has been in every release since 0.9.1-incubating. It appears to have been introduced by https://github.com/apache/storm/commit/483ce454a3b2cd31b5d1c34e9365346459b358a8

      So every apache release has this problem (which is the only reason I have not marked this as a blocker, because apparently it is not so bad that anyone has noticed in the past 4 years).

      Attachments

        Activity

          People

            revans2 Robert Joseph Evans
            revans2 Robert Joseph Evans
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h 40m
                1h 40m