Affects Version/s: 1.2.1
Fix Version/s: None
When deploying a topology on nimbus, there were errors in topology because of wrong configuration, due to which every request was failing in the topology. In our code, there is this logic that if a topology observes error more than a particular threshold, then it will issue storm deactivate topology command to nimbus.
The restart of topologies was being done via script.
The scenario was:
2 topologies were restarted successfully but facing errors due to wrong configuration. Because of the same, deactivate command was submitted to nimbus.
3rd topology was killed successfully and command for topology submission for the same was received successfully by nimbus.
At this point, storm UI stopped responding completely. When tried to run kill command via CLI on nimbus, it didn't work either and stayed stuck.
At this point, the 2 topologies with errors were still running as deactivation was not successful via nimbus. And the 3rd topology wasn't restarted via nimbus.
Any other commands ran on nimbus, stayed at stuck state until the nimbus was stopped on the leader machine and another nimbus was made leader.
There are no helpful logs in nimbus or supervisors or worker logs of affected topologies, apart from zookeeper info logs below:
zookeeper [INFO] exceptionorg.apache.storm.shade.org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /errors/<AFFECTED-TOPOLOGY-ID>/<BOLTNAME-WITH_ERRORS>/e0000001494
Storm version: 1.2.1