[NIFI-8477] If a node completely dies, can not delete it from the cluster; AKA Zombie Node - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.13.2
Fix Version/s: 1.14.0
Component/s: None
Labels:
- cluster
- disconnection
Environment:
Dockerized AWS ECS Instance

Description

Our nodes are ephemeral. Once they fall over, they don't come back in any stateful manner. This is known to create data loss, which we are aware of.

The issue we are seeing is, that when they fall over (right now we are forcefully knocking them over to test resiliency) the cluster heartbeat will flag them as disconnected, but there is no way to then delete them as we get a
ERROR: Error executing command 'delete-node' : Error deleting node: java.net.SocketTimeoutException: timeout

We have increased the read/connect timeouts to 20s (from default 5s) and that changes the error to a `read timeout`

Increasing those values to anything greater than 30s gives us unstable usage across the board

{ "servlet":"jerseySpring", "message":"Service Unavailable", "url":"/nifi-api/flow/current-user", "status":"503" }

ERROR: Error executing command 'get-nodes' : Read timed out
Occasionally, when some stars align, we are able to delete the node via the toolkit cli, but it happens far and few between but does lean itself to some timing issue.

Have discussed this thoroughly on the Nifi Slack, and Joe W. has mentioned
"I know something super bad happened and well - crap happens - can you help us clean the cluster back up and get on with life?"
and ...
we need a nicer option. I'm not sure if the CLI does something smart here

Attachments

Issue Links

links to

GitHub Pull Request #5039

Activity

People

Assignee:: Mark Payne

Reporter:: Chris McKeever

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Apr/21 19:05

Updated:: 10/May/21 20:16

Resolved:: 10/May/21 20:16

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 10m