Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Abandoned
-
2.1.0, 2.2.0
-
None
-
None
Description
When all Nimbus instances in a cluster loose access to previously stored Blobs while at least one topology is deployed, the cluster cannot recover as none of the nodes is ever elected as leader due to missing blobs. Recovery is only possible when manually removing blob and topology data from Zookeeper.
I understand that the LocalFs blob store implementation is not particularly suited for high availability deployments. However, this issue prevents sensible automated disaster recovery on small deployments where a full deployment of HDFS would not provide any benefits and simply introduce additional complexity.
Reproduction Steps
- Deploy one or multiple Nimbus instances
- Deploy a Topology (such as the WordCount example)
- Stop all Nimbus Instances
- Remove all Blob directories
- Start all Nimbus Instances
Expected Behavior
When a topology's blobs are permanently lost, the topology itself should be marked as failed in favor of maintaining the cluster's availability as a single lost topology suffices to take down the entire system.