Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-2044

Nimbus should not make assignments crazily when Pacemaker goes down and up

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.0.2
    • None
    • storm-core
    • CentOS 6.5
    • Important

    Description

      Now pacemaker is a stand-alone service and no HA is supported. When it goes down, all the workers's heartbeats will be lost. It will take a long time to recover even if pacemaker goes up immediately if there are dozens GB of heartbeats. During the time worker heartbeats are not restored completely, Nimbus will think these workers are dead because of heartbeats timeout and reassign these "dead" workers continuously until heartbeats restore to normal. So, during recovery time, many topologies will be reassigned continuously and the throughout will goes very down.
      This is not acceptable.
      So i think, pacemaker is not suitable for production if the problem above exists.
      i think several ways to solve this problem:
      1. pacemaker HA
      2. when pacemaker does down, notice nimbus not to reassign any more until it recover

      Attachments

        Activity

          People

            Unassigned Unassigned
            danny0405 Danny Chen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - 672h
                672h
                Remaining:
                Remaining Estimate - 672h
                672h
                Logged:
                Time Spent - Not Specified
                Not Specified