Flume
  1. Flume
  2. FLUME-53

Heartbeat from node "hangs" when changing configuration.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: v0.9.0
    • Fix Version/s: v0.9.1
    • Component/s: Node, Sinks+Sources
    • Labels:
      None

      Description

      Certain sinks / decorators can block or take a long time to close. This currently happens in the heartbeat thread and can make a node appear to be hung if the sinks/decos are blocked.

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open In Progress In Progress
          1d 8h 46m 1 Jonathan Hsieh 16/Jul/10 04:16
          In Progress In Progress Open Open
          13h 17m 1 Patrick Hunt 16/Jul/10 17:34
          Open Open Blocked Blocked
          13d 1h 41m 1 Jonathan Hsieh 29/Jul/10 19:15
          Blocked Blocked Resolved Resolved
          7d 1h 19m 1 Jonathan Hsieh 05/Aug/10 20:35
          Resolved Resolved Closed Closed
          129d 19h 3m 1 Jonathan Hsieh 13/Dec/10 14:38
          Mark Thomas made changes -
          Project Import Tue Aug 02 16:57:12 UTC 2011 [ 1312304232406 ]
          Jonathan Hsieh made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Jonathan Hsieh added a comment -

          Closing released issues.

          Show
          Jonathan Hsieh added a comment - Closing released issues.
          Jonathan Hsieh made changes -
          Resolution Fixed [ 1 ]
          Status Patch Available [ 10000 ] Resolved [ 5 ]
          Hide
          Jonathan Hsieh added a comment -

          committed

          Show
          Jonathan Hsieh added a comment - committed
          Jonathan Hsieh made changes -
          Attachment 0001-FLUME-53-Heartbeat-from-node-hangs-when-changing-con.patch [ 10076 ]
          Hide
          Jonathan Hsieh added a comment -

          This patch "fixes" the problem by pushing open and close actions into a separate thread. It is the simplest thing that could be done and the user display is accurate. It exacerbates the FLUME-37 more, patch for that forthcoming.

          Show
          Jonathan Hsieh added a comment - This patch "fixes" the problem by pushing open and close actions into a separate thread. It is the simplest thing that could be done and the user display is accurate. It exacerbates the FLUME-37 more, patch for that forthcoming.
          Jonathan Hsieh made changes -
          Status Open [ 1 ] Patch Available [ 10000 ]
          Jonathan Hsieh made changes -
          Link This issue depends on FLUME-69 [ FLUME-69 ]
          Hide
          Jonathan Hsieh added a comment -

          Added progress check to DFO clean close and it seems to work. Adding tests. This triggered a slew of other problems which at least for DFO seem to be cleaned up.

          Show
          Jonathan Hsieh added a comment - Added progress check to DFO clean close and it seems to work. Adding tests. This triggered a slew of other problems which at least for DFO seem to be cleaned up.
          Jonathan Hsieh made changes -
          Link This issue relates to FLUME-37 [ FLUME-37 ]
          Hide
          Jonathan Hsieh added a comment -

          Here's the high level approach for a solution.

          Two parts:

          1. push responsibility for opening and closing sources and sinks into the driver thread. This means that the heartbeat thread will likely just queue off the data retrieved for things that need to be done, and then hand this data off to the logical node driver thread that will execute it. Changing a config for that logical node may still hang or take a while but the heartbeats will continue.

          2. Modify clean close of DFODeco and WALDeco with a timeout if the subsink refuses to complete and is making no progress. This would be potentially be deemed an unclean exit (requiring recovery of durable logs) it would likely exit the driver thread by throwing an exception.

          Show
          Jonathan Hsieh added a comment - Here's the high level approach for a solution. Two parts: 1. push responsibility for opening and closing sources and sinks into the driver thread. This means that the heartbeat thread will likely just queue off the data retrieved for things that need to be done, and then hand this data off to the logical node driver thread that will execute it. Changing a config for that logical node may still hang or take a while but the heartbeats will continue. 2. Modify clean close of DFODeco and WALDeco with a timeout if the subsink refuses to complete and is making no progress. This would be potentially be deemed an unclean exit (requiring recovery of durable logs) it would likely exit the driver thread by throwing an exception.
          Hide
          Jonathan Hsieh added a comment - - edited

          Here's why this happens.

          Currently, open and close calls on sinks and sources happen in the same thread as the heartbeat thread. Thus , if open or close block or take a long time, the heartbeat thread becomes blocked. So if a sink were set to be a rpcSink to a machine or port that wasn't up, and it were to retry on failures, the node would be blocked. To make this worse, if there are multiple logical nodes on a physical node with one logical node blocking like this, all the nodes get blocked.

          A previous patch addressed part of the problem by making open lazy, which effectively pushed the open call it into the logical node's driver thread. This was great for the situations above – the open retries would happen in the logical node's driver thread.

          Unfortunately, since blocking still happen if close took a long time to complete. There are two common cases where this happens. DFO and WAL currently have semantics where any durable entries are flushed before close completes. When coupled with a sink that "never" fails, this means the DFO/WAL will never appear closed. This means the close seems effetively blocked and prevents new changes from going in.

          Show
          Jonathan Hsieh added a comment - - edited Here's why this happens. Currently, open and close calls on sinks and sources happen in the same thread as the heartbeat thread. Thus , if open or close block or take a long time, the heartbeat thread becomes blocked. So if a sink were set to be a rpcSink to a machine or port that wasn't up, and it were to retry on failures, the node would be blocked. To make this worse, if there are multiple logical nodes on a physical node with one logical node blocking like this, all the nodes get blocked. A previous patch addressed part of the problem by making open lazy, which effectively pushed the open call it into the logical node's driver thread. This was great for the situations above – the open retries would happen in the logical node's driver thread. Unfortunately, since blocking still happen if close took a long time to complete. There are two common cases where this happens. DFO and WAL currently have semantics where any durable entries are flushed before close completes. When coupled with a sink that "never" fails, this means the DFO/WAL will never appear closed. This means the close seems effetively blocked and prevents new changes from going in.
          Jonathan Hsieh made changes -
          Fix Version/s v0.9.1 [ 10013 ]
          Jonathan Hsieh made changes -
          Link This issue relates to FLUME-67 [ FLUME-67 ]
          Patrick Hunt made changes -
          Workflow jira [ 10122 ] flume-workflow [ 10195 ]
          Status In Progress [ 3 ] Open [ 1 ]
          Jonathan Hsieh made changes -
          Priority Major [ 3 ] Blocker [ 1 ]
          Jonathan Hsieh made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          Jonathan Hsieh made changes -
          Field Original Value New Value
          Assignee Jonathan Hsieh [ jmhsieh ]
          Jonathan Hsieh created issue -

            People

            • Assignee:
              Jonathan Hsieh
              Reporter:
              Jonathan Hsieh
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development