Flume
  1. Flume
  2. FLUME-53

Heartbeat from node "hangs" when changing configuration.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: v0.9.0
    • Fix Version/s: v0.9.1
    • Component/s: Node, Sinks+Sources
    • Labels:
      None

      Description

      Certain sinks / decorators can block or take a long time to close. This currently happens in the heartbeat thread and can make a node appear to be hung if the sinks/decos are blocked.

        Issue Links

          Activity

          Hide
          Jonathan Hsieh added a comment -

          Closing released issues.

          Show
          Jonathan Hsieh added a comment - Closing released issues.
          Hide
          Jonathan Hsieh added a comment -

          committed

          Show
          Jonathan Hsieh added a comment - committed
          Hide
          Jonathan Hsieh added a comment -

          This patch "fixes" the problem by pushing open and close actions into a separate thread. It is the simplest thing that could be done and the user display is accurate. It exacerbates the FLUME-37 more, patch for that forthcoming.

          Show
          Jonathan Hsieh added a comment - This patch "fixes" the problem by pushing open and close actions into a separate thread. It is the simplest thing that could be done and the user display is accurate. It exacerbates the FLUME-37 more, patch for that forthcoming.
          Hide
          Jonathan Hsieh added a comment -

          Added progress check to DFO clean close and it seems to work. Adding tests. This triggered a slew of other problems which at least for DFO seem to be cleaned up.

          Show
          Jonathan Hsieh added a comment - Added progress check to DFO clean close and it seems to work. Adding tests. This triggered a slew of other problems which at least for DFO seem to be cleaned up.
          Hide
          Jonathan Hsieh added a comment -

          Here's the high level approach for a solution.

          Two parts:

          1. push responsibility for opening and closing sources and sinks into the driver thread. This means that the heartbeat thread will likely just queue off the data retrieved for things that need to be done, and then hand this data off to the logical node driver thread that will execute it. Changing a config for that logical node may still hang or take a while but the heartbeats will continue.

          2. Modify clean close of DFODeco and WALDeco with a timeout if the subsink refuses to complete and is making no progress. This would be potentially be deemed an unclean exit (requiring recovery of durable logs) it would likely exit the driver thread by throwing an exception.

          Show
          Jonathan Hsieh added a comment - Here's the high level approach for a solution. Two parts: 1. push responsibility for opening and closing sources and sinks into the driver thread. This means that the heartbeat thread will likely just queue off the data retrieved for things that need to be done, and then hand this data off to the logical node driver thread that will execute it. Changing a config for that logical node may still hang or take a while but the heartbeats will continue. 2. Modify clean close of DFODeco and WALDeco with a timeout if the subsink refuses to complete and is making no progress. This would be potentially be deemed an unclean exit (requiring recovery of durable logs) it would likely exit the driver thread by throwing an exception.
          Hide
          Jonathan Hsieh added a comment - - edited

          Here's why this happens.

          Currently, open and close calls on sinks and sources happen in the same thread as the heartbeat thread. Thus , if open or close block or take a long time, the heartbeat thread becomes blocked. So if a sink were set to be a rpcSink to a machine or port that wasn't up, and it were to retry on failures, the node would be blocked. To make this worse, if there are multiple logical nodes on a physical node with one logical node blocking like this, all the nodes get blocked.

          A previous patch addressed part of the problem by making open lazy, which effectively pushed the open call it into the logical node's driver thread. This was great for the situations above – the open retries would happen in the logical node's driver thread.

          Unfortunately, since blocking still happen if close took a long time to complete. There are two common cases where this happens. DFO and WAL currently have semantics where any durable entries are flushed before close completes. When coupled with a sink that "never" fails, this means the DFO/WAL will never appear closed. This means the close seems effetively blocked and prevents new changes from going in.

          Show
          Jonathan Hsieh added a comment - - edited Here's why this happens. Currently, open and close calls on sinks and sources happen in the same thread as the heartbeat thread. Thus , if open or close block or take a long time, the heartbeat thread becomes blocked. So if a sink were set to be a rpcSink to a machine or port that wasn't up, and it were to retry on failures, the node would be blocked. To make this worse, if there are multiple logical nodes on a physical node with one logical node blocking like this, all the nodes get blocked. A previous patch addressed part of the problem by making open lazy, which effectively pushed the open call it into the logical node's driver thread. This was great for the situations above – the open retries would happen in the logical node's driver thread. Unfortunately, since blocking still happen if close took a long time to complete. There are two common cases where this happens. DFO and WAL currently have semantics where any durable entries are flushed before close completes. When coupled with a sink that "never" fails, this means the DFO/WAL will never appear closed. This means the close seems effetively blocked and prevents new changes from going in.

            People

            • Assignee:
              Jonathan Hsieh
              Reporter:
              Jonathan Hsieh
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development