Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
1.13.2
-
None
-
Dockerized AWS ECS Instance
Description
Our nodes are ephemeral. Once they fall over, they don't come back in any stateful manner. This is known to create data loss, which we are aware of.
The issue we are seeing is, that when they fall over (right now we are forcefully knocking them over to test resiliency) the cluster heartbeat will flag them as disconnected, but there is no way to then delete them as we get a
ERROR: Error executing command 'delete-node' : Error deleting node: java.net.SocketTimeoutException: timeout
We have increased the read/connect timeouts to 20s (from default 5s) and that changes the error to a `read timeout`
Increasing those values to anything greater than 30s gives us unstable usage across the board
{ "servlet":"jerseySpring", "message":"Service Unavailable", "url":"/nifi-api/flow/current-user", "status":"503" }ERROR: Error executing command 'get-nodes' : Read timed out
Occasionally, when some stars align, we are able to delete the node via the toolkit cli, but it happens far and few between but does lean itself to some timing issue.
Have discussed this thoroughly on the Nifi Slack, and Joe W. has mentioned
"I know something super bad happened and well - crap happens - can you help us clean the cluster back up and get on with life?"
and ...
we need a nicer option. I'm not sure if the CLI does something smart here
Attachments
Issue Links
- links to