Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-2733

Make Load Aware Shuffle much better at really bad situations

    XMLWordPrintableJSON

Details

    Description

      We recently had an issue where some bolts got really backed up and started to die from OOMs. The issue ended up being 2 fold.

      First the GC really slowed down the worker so much that it could not keep up even with < 1% of the traffic that was still being sent to it. Which made it almost impossible to recover.

      The second issue was that the serialization of the tuples took a lot longer than the processing, which resulted in the send queue filling up much more quickly than the receive queue.

      To help fix this issue I plan to address this in 2 ways. First we need a better algorithm that can actually shut off the flow entirely to a very slow bolt and second we need to take the send queue into account when shuffling.

      This is not a full set of changes needed by STORM-2686 but it is a step in that direction. I am going to try and set it up so that the two algorithms would work nicely together.

      Attachments

        Issue Links

          Activity

            People

              revans2 Robert Joseph Evans
              revans2 Robert Joseph Evans
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m