Here's why this happens.
Currently, open and close calls on sinks and sources happen in the same thread as the heartbeat thread. Thus , if open or close block or take a long time, the heartbeat thread becomes blocked. So if a sink were set to be a rpcSink to a machine or port that wasn't up, and it were to retry on failures, the node would be blocked. To make this worse, if there are multiple logical nodes on a physical node with one logical node blocking like this, all the nodes get blocked.
A previous patch addressed part of the problem by making open lazy, which effectively pushed the open call it into the logical node's driver thread. This was great for the situations above – the open retries would happen in the logical node's driver thread.
Unfortunately, since blocking still happen if close took a long time to complete. There are two common cases where this happens. DFO and WAL currently have semantics where any durable entries are flushed before close completes. When coupled with a sink that "never" fails, this means the DFO/WAL will never appear closed. This means the close seems effetively blocked and prevents new changes from going in.