Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-2693

Topology submission or kill takes too much time when topologies grow to a few hundred

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.6, 1.0.2, 1.1.0, 1.0.3
    • Fix Version/s: 2.0.0
    • Component/s: storm-core

      Description

      Now for a storm cluster with 40 hosts [with 32 cores/128G memory] and hundreds of topologies, nimbus submission and killing will take about minutes to finish. For example, for a cluster with 300 hundred of topologies´╝îit will take about 8 minutes to submit a topology, this affect our efficiency seriously.

      So, i check out the nimbus code and find two factor that will effect nimbus submission/killing time for a scheduling round:

      • read existing-assignments from zookeeper for every topology [will take about 4 seconds for a 300 topologies cluster]
      • read all the workers heartbeats and update the state to nimbus cache [will take about 30 seconds for a 300 topologies cluster]
        the key here is that Storm now use zookeeper to collect heartbeats [not RPC], and also keep physical plan [assignments] using zookeeper which can be totally local in nimbus.

      So, i think we should make some changes to storm's heartbeats and assignments management.

      For assignment promotion:
      1. nimbus will put the assignments in local disk
      2. when restart or HA leader trigger nimbus will recover assignments from zk to local disk
      3. nimbus will tell supervisor its assignment every time through RPC every scheduling round
      4. supervisor will sync assignments at fixed time

      For heartbeats promotion:
      1. workers will report executors ok or wrong to supervisor at fixed time
      2. supervisor will report workers heartbeats to nimbus at fixed time
      3. if supervisor die, it will tell nimbus through runtime hook
      or let nimbus find it through aware supervisor if is survive
      4. let supervisor decide if worker is running ok or invalid , supervisor will tell nimbus which executors of every topology are ok

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                danny0405 Danny Chan
                Reporter:
                danny0405 Danny Chan
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 43h 20m
                  43h 20m