Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-8477

If a node completely dies, can not delete it from the cluster; AKA Zombie Node

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.13.2
    • 1.14.0
    • None
    • Dockerized AWS ECS Instance

    Description

      Our nodes are ephemeral. Once they fall over, they don't come back in any stateful manner. This is known to create data loss, which we are aware of. 

      The issue we are seeing is, that when they fall over (right now we are forcefully knocking them over to test resiliency) the cluster heartbeat will flag them as disconnected, but there is no way to then delete them as we get a 
      ERROR: Error executing command 'delete-node' : Error deleting node: java.net.SocketTimeoutException: timeout

      We have increased the read/connect timeouts to 20s (from default 5s) and that changes the error to a `read timeout`

      Increasing those values to anything greater than 30s gives us unstable usage across the board

      { "servlet":"jerseySpring", "message":"Service Unavailable", "url":"/nifi-api/flow/current-user", "status":"503" }

      ERROR: Error executing command 'get-nodes' : Read timed out
      Occasionally, when some stars align, we are able to delete the node via the toolkit cli, but it happens far and few between but does lean itself to some timing issue.

       

      Have discussed this thoroughly on the Nifi Slack, and Joe W. has mentioned 
      "I know something super bad happened and well - crap happens - can you help us clean the cluster back up and get on with life?"
      and ... 
      we need a nicer option.  I'm not sure if the CLI does something smart here
       

      Attachments

        Issue Links

          Activity

            People

              markap14 Mark Payne
              cgmckeever Chris McKeever
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m