Flume
  1. Flume
  2. FLUME-711

Reconfiguring nodes while transfering data causes unrecoverable failure

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Blocker Blocker
    • Resolution: Unresolved
    • Affects Version/s: v0.9.3
    • Fix Version/s: v0.9.5
    • Component/s: Node, Sinks+Sources
    • Labels:
      None

      Description

      Reconfiguring a collector node that is transferring data causes it to enter a state where it becomes unresponsive to new data but doesn't seem to fail.

        Activity

        Hide
        fang fang chen added a comment -

        Is this an issue which can be reproduced stably?
        In flume-0.9.4, data transferring sometime will pause with:
        [Roll-TriggerThread-0] WARN conf.FlumeBuilder: Deprecated syntax: Expected a format spec but instead had a (String) avrojson. But I did not encounter the error in Attachments.

        Show
        fang fang chen added a comment - Is this an issue which can be reproduced stably? In flume-0.9.4, data transferring sometime will pause with: [Roll-TriggerThread-0] WARN conf.FlumeBuilder: Deprecated syntax: Expected a format spec but instead had a (String) avrojson. But I did not encounter the error in Attachments.
        Hide
        Prasad Mujumdar added a comment -

        FLUME-927 is addresses an issue related blocked flow during collector restart/reconfigure. could be related

        Show
        Prasad Mujumdar added a comment - FLUME-927 is addresses an issue related blocked flow during collector restart/reconfigure. could be related
        Hide
        E. Sammer added a comment -

        I've unassigned this from myself. I'm not going to be able to work on it.

        Show
        E. Sammer added a comment - I've unassigned this from myself. I'm not going to be able to work on it.
        E. Sammer made changes -
        Assignee E. Sammer [ esammer ]
        Hide
        Sergey Tryuber added a comment -

        I've also encountered this issue when exec() source already worked, but not all data has been sent from agent.

        Show
        Sergey Tryuber added a comment - I've also encountered this issue when exec() source already worked, but not all data has been sent from agent.
        Mark Thomas made changes -
        Project Import Tue Aug 02 16:57:12 UTC 2011 [ 1312304232406 ]
        Hide
        E. Sammer added a comment - - edited

        The full details:

        Config:

        n1 : execStream("tail -F datafile") | agentE2ESink("10.0.3.227", 12345);
        n2 : collectorSource(12345) | collectorSink("file://...", "n2-");
        map 10.0.3.227 n1
        map 10.0.3.227 n2
        

        Data generator:

        perl -e 'while (1) { print $i++, "\n"; }' >> datafile
        

        Steps:
        1. See that data transfers correctly; collector is rolling, agent receiving ACKs, WAL functional.
        2. Re-execute the exact same n2 config (not the mapping, just the n2 config).
        3. Note how flume is unable to resume data transfer.

        • E2E WAL dirs just fill endlessly.

        Noteworthy log points:

        mast 14:03:42,946 - configs loaded from ZK.
        mast 14:03:43,144 - configs loaded from ZK (not sure why it happended a second time. translation?)
        mast 14:03:51,534 - n1 heartbeat - active
        mast 14:04:21,584 - n2 heartbeat - active (ignore the previous open / close for n2 - that's FLUME-706)
        mast 14:08:44,505 - command to reconfig n2 received
        mast 14:08:46,394 - master receives a CancelledKeyException while talking to ZK. Not sure this matters.
        mast 14:08:47,526 - configs loaded from ZK (probably due to CancelledKeyException).
        node 14:08:47,936 - the collector (logicalNode n2-44) close is requested.
        node 14:08:48,940 - n2-44 still has 1000 elements.
        mast 14:08:52,896 - n2 heartbeat - closing
        node 14:08:58,950 - n2-44 times out while closing due to no progress on close.
        node 14:08:59,078 - n2-44 driver completes. n2-44 no longer exists.
        node 14:08:59,094 - n2-64 comes into existence - a new instance of the collector.
        mast 14:09:02,965 - n2 heartbeat - active (presumably for n2-64)
        node              - note that there are *no more appearances* of n2 from this point!
        mast              - n2 continues to heartbeat normally.
        node 14:10:48,497 - retransmission of an outstanding ACK group. stale: 60332ms. the first of many.
        node 14:10:48,498 - race between SENT vs. SENDING message. Possibly unrelated. Correlated with retrans of ACK above.
        node 14:32:18,391 - n1 hits an error appending an event. Root cause: "Blocked append interrupted by rotate."
        node 14:32:18,393 - "Input stream pipe closed" from (appears to be related to n1 failure.)
        node 14:32:18,397 - n1 error while waiting for exec threads to exit.
        node 14:32:18,992 - n1 attempts to ensure WAL thread is closed.
        node              - n1 - tons of errors about "expected IDLE but timed out in state ACTIVE." followed by...
        node              - n1 - endless "Attempt X with no progress on wal consumer subsink" (These two alternate.)
        node              - Roll, heartbeat, checkconfig threads continue.
        node 14:45:01,719 - Node shutdown.
        mast 14:45:05,585 - master stopped. No exceptions, no node state changes in heartbeats.
        
        Show
        E. Sammer added a comment - - edited The full details: Config: n1 : execStream( "tail -F datafile" ) | agentE2ESink( "10.0.3.227" , 12345); n2 : collectorSource(12345) | collectorSink( "file: //..." , "n2-" ); map 10.0.3.227 n1 map 10.0.3.227 n2 Data generator: perl -e ' while (1) { print $i++, "\n" ; }' >> datafile Steps: 1. See that data transfers correctly; collector is rolling, agent receiving ACKs, WAL functional. 2. Re-execute the exact same n2 config (not the mapping, just the n2 config). 3. Note how flume is unable to resume data transfer. E2E WAL dirs just fill endlessly. Noteworthy log points: mast 14:03:42,946 - configs loaded from ZK. mast 14:03:43,144 - configs loaded from ZK (not sure why it happended a second time. translation?) mast 14:03:51,534 - n1 heartbeat - active mast 14:04:21,584 - n2 heartbeat - active (ignore the previous open / close for n2 - that's FLUME-706) mast 14:08:44,505 - command to reconfig n2 received mast 14:08:46,394 - master receives a CancelledKeyException while talking to ZK. Not sure this matters. mast 14:08:47,526 - configs loaded from ZK (probably due to CancelledKeyException). node 14:08:47,936 - the collector (logicalNode n2-44) close is requested. node 14:08:48,940 - n2-44 still has 1000 elements. mast 14:08:52,896 - n2 heartbeat - closing node 14:08:58,950 - n2-44 times out while closing due to no progress on close. node 14:08:59,078 - n2-44 driver completes. n2-44 no longer exists. node 14:08:59,094 - n2-64 comes into existence - a new instance of the collector. mast 14:09:02,965 - n2 heartbeat - active (presumably for n2-64) node - note that there are *no more appearances* of n2 from this point! mast - n2 continues to heartbeat normally. node 14:10:48,497 - retransmission of an outstanding ACK group. stale: 60332ms. the first of many. node 14:10:48,498 - race between SENT vs. SENDING message. Possibly unrelated. Correlated with retrans of ACK above. node 14:32:18,391 - n1 hits an error appending an event. Root cause: "Blocked append interrupted by rotate." node 14:32:18,393 - "Input stream pipe closed" from (appears to be related to n1 failure.) node 14:32:18,397 - n1 error while waiting for exec threads to exit. node 14:32:18,992 - n1 attempts to ensure WAL thread is closed. node - n1 - tons of errors about "expected IDLE but timed out in state ACTIVE." followed by... node - n1 - endless "Attempt X with no progress on wal consumer subsink" (These two alternate.) node - Roll, heartbeat, checkconfig threads continue . node 14:45:01,719 - Node shutdown. mast 14:45:05,585 - master stopped. No exceptions, no node state changes in heartbeats.
        E. Sammer made changes -
        Field Original Value New Value
        Attachment collector_restart_case.tar.gz [ 10743 ]
        Hide
        E. Sammer added a comment -

        Attaching tarball containing logs, the configuration used, and netstat -nl output.

        Show
        E. Sammer added a comment - Attaching tarball containing logs, the configuration used, and netstat -nl output.
        E. Sammer created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            E. Sammer
          • Votes:
            3 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:

              Development