[STORM-3664] Nimbus cannot recover from LocalFsBlobStore deletion - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Abandoned
Affects Version/s: 2.1.0, 2.2.0
Fix Version/s: None
Component/s: blobstore, storm-server
Labels:
None

Description

When all Nimbus instances in a cluster loose access to previously stored Blobs while at least one topology is deployed, the cluster cannot recover as none of the nodes is ever elected as leader due to missing blobs. Recovery is only possible when manually removing blob and topology data from Zookeeper.

I understand that the LocalFs blob store implementation is not particularly suited for high availability deployments. However, this issue prevents sensible automated disaster recovery on small deployments where a full deployment of HDFS would not provide any benefits and simply introduce additional complexity.

Reproduction Steps

Deploy one or multiple Nimbus instances
Deploy a Topology (such as the WordCount example)
Stop all Nimbus Instances
Remove all Blob directories
Start all Nimbus Instances

Expected Behavior

When a topology's blobs are permanently lost, the topology itself should be marked as failed in favor of maintaining the cluster's availability as a single lost topology suffices to take down the entire system.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Johannes Donath

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 02/Jul/20 08:01

Updated:: 02/Jul/24 09:53

Resolved:: 02/Jul/24 09:53