Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-3984

Nimbus failover causes unnecessary reassign if 600s are passed after starting Nimbus

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.5.0
    • 2.6.0
    • storm-server
    • None

    Description

      Since the TimeOutWorkerHeartbeatsRecoveryStrategy.startTimeSecs is set on Nimbus start-up and never updated,TimeOutWorkerHeartbeatsRecoveryStrategy#exceedsMaxTimeOut always returns true after 600s (the value of supervisor.worker.heartbeats.max.timeout.secs) are passed after Nimbus is started.

      Invalid timeout in new leader Nimbus causes unnecessary reassign right after failover.

      2023-09-25 15:16:46.538 o.a.s.n.NimbusInfo main-EventThread [INFO] Nimbus figures out its name to h02
      2023-09-25 15:16:46.549 o.a.s.n.LeaderListenerCallback main-EventThread [INFO] Sync remote assignments and id-info to local
      2023-09-25 15:16:46.571 o.a.s.n.LeaderListenerCallback main-EventThread [INFO] active-topology-blobs [word-count-1-1695654263] local-topology-blobs [word-count-1-1695654263-stormconf.ser,word-count-1-1695654263-stormjar.jar,word-count-1-1695654263-stormcode.ser] diff-topology-blobs []
      2023-09-25 15:16:46.596 o.a.s.n.LeaderListenerCallback main-EventThread [INFO] active-topology-dependencies [] local-blobs [word-count-1-1695654263-stormconf.ser,word-count-1-1695654263-stormjar.jar,word-count-1-1695654263-stormcode.ser] diff-topology-dependencies []
      2023-09-25 15:16:46.596 o.a.s.n.LeaderListenerCallback main-EventThread [INFO] Accepting leadership, all active topologies and corresponding dependencies found locally.
      2023-09-25 15:16:46.596 o.a.s.z.LeaderListenerCallbackFactory main-EventThread [INFO] h02 gained leadership.
      2023-09-25 15:16:46.744 o.a.s.n.TimeOutWorkerHeartbeatsRecoveryStrategy timer [WARN] Failed to recover heartbeats for nodes: [c26e72ef-b84b-4d44-820a-fec9407e38cf-172.18.0.11, 57ff205e-6d90-4305-abb8-b9ff0ff7bcc3-172.18.0.13, f10f6554-0e55-4c01-a6ce-834df068d753-172.18.0.12] with timeout 600s
      2023-09-25 15:16:46.807 o.a.s.d.n.HeartbeatCache timer [INFO] Executor word-count-1-1695654263:[8, 8] not alive
      2023-09-25 15:16:46.808 o.a.s.d.n.HeartbeatCache timer [INFO] Executor word-count-1-1695654263:[16, 16] not alive
      ...(snip)
      2023-09-25 15:16:46.862 o.a.s.d.n.Nimbus timer [INFO] Reassigning word-count-1-1695654263 to 3 slots
      2023-09-25 15:16:46.862 o.a.s.d.n.Nimbus timer [INFO] Reassign executors: [[20, 20], [14, 14], [12, 12], [16, 16], [18, 18], [28, 28], [26, 26], [10, 10], [8, 8], [24, 24], [6, 6], [22, 22], [2, 2], [4, 4], [13, 13], [11, 11], [7, 7], [9, 9], [19, 19], [23, 23], [21, 21], [25, 25], [27, 27], [5, 5]\
      , [1, 1], [3, 3], [15, 15], [17, 17]]
      

      Attachments

        Issue Links

          Activity

            People

              iwasakims Masatake Iwasaki
              iwasakims Masatake Iwasaki
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h