Flume
  1. Flume
  2. FLUME-706

Flume nodes launch duplicate logical nodes

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: v0.9.5
    • Fix Version/s: v0.9.5
    • Component/s: Master, Node
    • Labels:
      None

      Description

      When submitting a config command to the flume master, it seems as if the downstream node attempts to load the config twice.

      In a test case, starting a single master and a single node, I submitted a "config node rpcSource(12345) console". The node sees the config change on the next heartbeat and updates its config and starts the thrift source on port 12345. Immediately after, it logs "Taking another heartbeat" (DEBUG) and attempts to create another logical node with the same config. This leads to thrift errors in bind() and "Could not create ServerSocket on address ...". Looking at the root cause in a debugger (thrift swallows the original exception) I can see it's an "Address already in use" IOException.

        Issue Links

          Activity

          Hide
          Jonathan Hsieh added a comment -

          Committed

          Show
          Jonathan Hsieh added a comment - Committed
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1467/#review1637
          -----------------------------------------------------------

          no review for over a week. committing.

          • jmhsieh

          On 2011-08-17 19:55:38, jmhsieh wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1467/

          -----------------------------------------------------------

          (Updated 2011-08-17 19:55:38)

          Review request for Flume, Arvind Prabhakar and Eric Sammer.

          Summary

          -------

          commit 34b0ada18f38d82b8acee4c2ec1a5b6693e524ea

          Author: Jonathan Hsieh <jmhsieh@apache.org>

          Date: Wed Aug 17 10:35:22 2011 -0700

          FLUME-706: Flume nodes launch duplicate logical nodes

          When a logical node is being spawned for the first time we attempt to load the config of the node. Unfortunately, we would subsequently load it

          again and spawn a second driver thread because we neglected to update the last good config version. This fixes the problem by making sure that

          value gets updated on the first attempt. We also update error handling so that a failure of signle logical node spawn only affects that node.

          This addresses bug flume-706.

          https://issues.apache.org/jira/browse/flume-706

          Diffs

          -----

          flume-core/src/main/java/com/cloudera/flume/agent/FlumeNode.java b8f2b67

          flume-core/src/main/java/com/cloudera/flume/agent/LivenessManager.java c72a626

          flume-core/src/main/java/com/cloudera/flume/agent/LogicalNode.java 3f64238

          flume-core/src/main/java/com/cloudera/flume/agent/LogicalNodeManager.java b3f96f2

          flume-core/src/main/java/com/cloudera/flume/conf/FlumeConfigData.java 9e660cc

          flume-core/src/test/java/com/cloudera/flume/agent/TestAgentCloseNoDeadlock.java e1353b8

          flume-core/src/test/java/com/cloudera/flume/agent/TestLogicalNodeManager.java 0fd4bc6

          flume-core/src/test/java/com/cloudera/flume/agent/diskfailover/TestDiskFailoverBehavior.java 831eca3

          flume-core/src/test/java/com/cloudera/flume/shell/TestFlumeShell.java f81b190

          Diff: https://reviews.apache.org/r/1467/diff

          Testing

          -------

          Added new test, it passes. Currently running full test suite.

          Thanks,

          jmhsieh

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1467/#review1637 ----------------------------------------------------------- no review for over a week. committing. jmhsieh On 2011-08-17 19:55:38, jmhsieh wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1467/ ----------------------------------------------------------- (Updated 2011-08-17 19:55:38) Review request for Flume, Arvind Prabhakar and Eric Sammer. Summary ------- commit 34b0ada18f38d82b8acee4c2ec1a5b6693e524ea Author: Jonathan Hsieh <jmhsieh@apache.org> Date: Wed Aug 17 10:35:22 2011 -0700 FLUME-706 : Flume nodes launch duplicate logical nodes When a logical node is being spawned for the first time we attempt to load the config of the node. Unfortunately, we would subsequently load it again and spawn a second driver thread because we neglected to update the last good config version. This fixes the problem by making sure that value gets updated on the first attempt. We also update error handling so that a failure of signle logical node spawn only affects that node. This addresses bug flume-706. https://issues.apache.org/jira/browse/flume-706 Diffs ----- flume-core/src/main/java/com/cloudera/flume/agent/FlumeNode.java b8f2b67 flume-core/src/main/java/com/cloudera/flume/agent/LivenessManager.java c72a626 flume-core/src/main/java/com/cloudera/flume/agent/LogicalNode.java 3f64238 flume-core/src/main/java/com/cloudera/flume/agent/LogicalNodeManager.java b3f96f2 flume-core/src/main/java/com/cloudera/flume/conf/FlumeConfigData.java 9e660cc flume-core/src/test/java/com/cloudera/flume/agent/TestAgentCloseNoDeadlock.java e1353b8 flume-core/src/test/java/com/cloudera/flume/agent/TestLogicalNodeManager.java 0fd4bc6 flume-core/src/test/java/com/cloudera/flume/agent/diskfailover/TestDiskFailoverBehavior.java 831eca3 flume-core/src/test/java/com/cloudera/flume/shell/TestFlumeShell.java f81b190 Diff: https://reviews.apache.org/r/1467/diff Testing ------- Added new test, it passes. Currently running full test suite. Thanks, jmhsieh
          Hide
          Jonathan Hsieh added a comment -

          Updated patch includes tests cases and cleans up some exception handling.

          Show
          Jonathan Hsieh added a comment - Updated patch includes tests cases and cleans up some exception handling.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1467/
          -----------------------------------------------------------

          (Updated 2011-08-17 19:55:38.022879)

          Review request for Flume, Arvind Prabhakar and Eric Sammer.

          Changes
          -------

          Updated to clean up exception handling when a spawn fails, and with real unit test that tests to root problem. Look at the diff between #1 and #2 to see improved exception handling and added test case.

          Summary (updated)
          -------

          commit 34b0ada18f38d82b8acee4c2ec1a5b6693e524ea
          Author: Jonathan Hsieh <jmhsieh@apache.org>
          Date: Wed Aug 17 10:35:22 2011 -0700

          FLUME-706: Flume nodes launch duplicate logical nodes

          When a logical node is being spawned for the first time we attempt to load the config of the node. Unfortunately, we would subsequently load it
          again and spawn a second driver thread because we neglected to update the last good config version. This fixes the problem by making sure that
          value gets updated on the first attempt. We also update error handling so that a failure of signle logical node spawn only affects that node.

          This addresses bug flume-706.
          https://issues.apache.org/jira/browse/flume-706

          Diffs (updated)


          flume-core/src/main/java/com/cloudera/flume/agent/FlumeNode.java b8f2b67
          flume-core/src/main/java/com/cloudera/flume/agent/LivenessManager.java c72a626
          flume-core/src/main/java/com/cloudera/flume/agent/LogicalNode.java 3f64238
          flume-core/src/main/java/com/cloudera/flume/agent/LogicalNodeManager.java b3f96f2
          flume-core/src/main/java/com/cloudera/flume/conf/FlumeConfigData.java 9e660cc
          flume-core/src/test/java/com/cloudera/flume/agent/TestAgentCloseNoDeadlock.java e1353b8
          flume-core/src/test/java/com/cloudera/flume/agent/TestLogicalNodeManager.java 0fd4bc6
          flume-core/src/test/java/com/cloudera/flume/agent/diskfailover/TestDiskFailoverBehavior.java 831eca3
          flume-core/src/test/java/com/cloudera/flume/shell/TestFlumeShell.java f81b190

          Diff: https://reviews.apache.org/r/1467/diff

          Testing (updated)
          -------

          Added new test, it passes. Currently running full test suite.

          Thanks,

          jmhsieh

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1467/ ----------------------------------------------------------- (Updated 2011-08-17 19:55:38.022879) Review request for Flume, Arvind Prabhakar and Eric Sammer. Changes ------- Updated to clean up exception handling when a spawn fails, and with real unit test that tests to root problem. Look at the diff between #1 and #2 to see improved exception handling and added test case. Summary (updated) ------- commit 34b0ada18f38d82b8acee4c2ec1a5b6693e524ea Author: Jonathan Hsieh <jmhsieh@apache.org> Date: Wed Aug 17 10:35:22 2011 -0700 FLUME-706 : Flume nodes launch duplicate logical nodes When a logical node is being spawned for the first time we attempt to load the config of the node. Unfortunately, we would subsequently load it again and spawn a second driver thread because we neglected to update the last good config version. This fixes the problem by making sure that value gets updated on the first attempt. We also update error handling so that a failure of signle logical node spawn only affects that node. This addresses bug flume-706. https://issues.apache.org/jira/browse/flume-706 Diffs (updated) flume-core/src/main/java/com/cloudera/flume/agent/FlumeNode.java b8f2b67 flume-core/src/main/java/com/cloudera/flume/agent/LivenessManager.java c72a626 flume-core/src/main/java/com/cloudera/flume/agent/LogicalNode.java 3f64238 flume-core/src/main/java/com/cloudera/flume/agent/LogicalNodeManager.java b3f96f2 flume-core/src/main/java/com/cloudera/flume/conf/FlumeConfigData.java 9e660cc flume-core/src/test/java/com/cloudera/flume/agent/TestAgentCloseNoDeadlock.java e1353b8 flume-core/src/test/java/com/cloudera/flume/agent/TestLogicalNodeManager.java 0fd4bc6 flume-core/src/test/java/com/cloudera/flume/agent/diskfailover/TestDiskFailoverBehavior.java 831eca3 flume-core/src/test/java/com/cloudera/flume/shell/TestFlumeShell.java f81b190 Diff: https://reviews.apache.org/r/1467/diff Testing (updated) ------- Added new test, it passes. Currently running full test suite. Thanks, jmhsieh
          Hide
          Jonathan Hsieh added a comment -

          @Satish

          Just noticed your comment. I think we picked a similar approach to fix the problem. Cool!

          Show
          Jonathan Hsieh added a comment - @Satish Just noticed your comment. I think we picked a similar approach to fix the problem. Cool!
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1467/
          -----------------------------------------------------------

          Review request for Flume.

          Summary
          -------

          rough draft of a different approach to tackling FLUME-706

          This addresses bug flume-706.
          https://issues.apache.org/jira/browse/flume-706

          Diffs


          flume-core/src/main/java/com/cloudera/flume/agent/FlumeNode.java b8f2b67
          flume-core/src/main/java/com/cloudera/flume/agent/LivenessManager.java c72a626
          flume-core/src/main/java/com/cloudera/flume/agent/LogicalNode.java 3f64238
          flume-core/src/main/java/com/cloudera/flume/agent/LogicalNodeManager.java b3f96f2
          flume-core/src/main/java/com/cloudera/flume/conf/FlumeConfigData.java 9e660cc
          flume-core/src/test/java/com/cloudera/flume/agent/TestAgentCloseNoDeadlock.java e1353b8
          flume-core/src/test/java/com/cloudera/flume/agent/TestLogicalNodeManager.java 0fd4bc6
          flume-core/src/test/java/com/cloudera/flume/agent/diskfailover/TestDiskFailoverBehavior.java 831eca3

          Diff: https://reviews.apache.org/r/1467/diff

          Testing
          -------

          running tests currently, need to add tests and cleanup still.

          Thanks,

          jmhsieh

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1467/ ----------------------------------------------------------- Review request for Flume. Summary ------- rough draft of a different approach to tackling FLUME-706 This addresses bug flume-706. https://issues.apache.org/jira/browse/flume-706 Diffs flume-core/src/main/java/com/cloudera/flume/agent/FlumeNode.java b8f2b67 flume-core/src/main/java/com/cloudera/flume/agent/LivenessManager.java c72a626 flume-core/src/main/java/com/cloudera/flume/agent/LogicalNode.java 3f64238 flume-core/src/main/java/com/cloudera/flume/agent/LogicalNodeManager.java b3f96f2 flume-core/src/main/java/com/cloudera/flume/conf/FlumeConfigData.java 9e660cc flume-core/src/test/java/com/cloudera/flume/agent/TestAgentCloseNoDeadlock.java e1353b8 flume-core/src/test/java/com/cloudera/flume/agent/TestLogicalNodeManager.java 0fd4bc6 flume-core/src/test/java/com/cloudera/flume/agent/diskfailover/TestDiskFailoverBehavior.java 831eca3 Diff: https://reviews.apache.org/r/1467/diff Testing ------- running tests currently, need to add tests and cleanup still. Thanks, jmhsieh
          Hide
          Jonathan Hsieh added a comment -

          Eric's logs and initial logs were extremely helpful digging into this.

          I've attached a cut of my attempt to fix the problem based on the threading analysis in the previous comment. This patch passes some the manual testing and I believe the fixes make sense.

          Caveat: It is not polished yet, and I have not written a new tests to check for this error condition, I have just kicked off a job to run tests the full suite to make sure there are no new regressions.

          Show
          Jonathan Hsieh added a comment - Eric's logs and initial logs were extremely helpful digging into this. I've attached a cut of my attempt to fix the problem based on the threading analysis in the previous comment. This patch passes some the manual testing and I believe the fixes make sense. Caveat: It is not polished yet, and I have not written a new tests to check for this error condition, I have just kicked off a job to run tests the full suite to make sure there are no new regressions.
          Hide
          Jonathan Hsieh added a comment - - edited

          Note: this seemed flaky because it will only ever occur on a mapped logical node. This doesn't affect the default node because this is always present and has a correct FlumeConfigData config and version. This will probably not happen either if a node is mapped first and then config'ed after the node has been spawned.

          Show
          Jonathan Hsieh added a comment - - edited Note: this seemed flaky because it will only ever occur on a mapped logical node. This doesn't affect the default node because this is always present and has a correct FlumeConfigData config and version. This will probably not happen either if a node is mapped first and then config'ed after the node has been spawned.
          Hide
          Jonathan Hsieh added a comment -

          Ok, I think I have a sequence of events that create the bug. There is an interaction between three threads that cause this. CCT is the Check Config Thread (gets config data from queue), HBT is the heartbeat thread (checks for spawning, checks for configs and enqueues them for CCT and other stuff not relevent to this) and a PT1 and PT2 (pumper/driver threads). In this situation there is only one FlumeConfigData (FCD) passed that contains the ThriftSource spec and some arbitrary Sink spec.

          CCT starts, blocks on empty queue
          HBT makes rpcs to master to heartbeat
          HBT calls checkLogicalNode, learns about new logicalnode
          HBT realizes logicalnode is new,
          HBT gets FCD (thriftSource, sink),
          HBT "spawns" the node by starting PT with FCD info. // (this does not update the config version number)
          PT starts, calling thriftSource.open, sink.open
          PT enters thriftSource.append loop (shipping data from source to sink)
          HBT checkLogicalNodeConfigs
          HBT notices that it needs a new config
          HBT fetches and enqueues FCD // (this actually has already been fetched by checkLogicalNode step)
          CCT unblocks dequeuing a flumeConfigData(FCD)
          CCT calls logicalnode's checkConfig(FCD)
          CCT believes the last good FCD is (nullSource, nullSink at unixtime 0) // (this is because we didn't update the version number earlier)
          CCT attempts to load the FCD because it thinks it is new.
          CCT instaintiates new instances of source and sink.
          CCT attempts to nicely shutdown previously PT. (via stop call on driver thread)
          CCT attempts to join on PT
          CCT times out on join and then issues a thread cancel (depending on where PT is, it may not catch this interruption)
          CCT attempts to start new direct driver thread (PT2)
          PT2's open attempt fails because network port already bound (doesn't throw exception)
          PT1 reaches a close call (doesn't close right away because queue is full of stuff)
          CCT finally sets last good config.

          ThriftSource is in a closing state, and neither PT1 or PT2 are functioning properly.

          There are a few paper cuts but a quick fix seems to be to properly set the version properly on during the spawn so that the second call that gets the same FCD isn't initiated.

          Another likely more robust approach is to make to make spawn happen in the single CCT thread instead of the HBT.

          Show
          Jonathan Hsieh added a comment - Ok, I think I have a sequence of events that create the bug. There is an interaction between three threads that cause this. CCT is the Check Config Thread (gets config data from queue), HBT is the heartbeat thread (checks for spawning, checks for configs and enqueues them for CCT and other stuff not relevent to this) and a PT1 and PT2 (pumper/driver threads). In this situation there is only one FlumeConfigData (FCD) passed that contains the ThriftSource spec and some arbitrary Sink spec. CCT starts, blocks on empty queue HBT makes rpcs to master to heartbeat HBT calls checkLogicalNode, learns about new logicalnode HBT realizes logicalnode is new, HBT gets FCD (thriftSource, sink), HBT "spawns" the node by starting PT with FCD info. // (this does not update the config version number) PT starts, calling thriftSource.open, sink.open PT enters thriftSource.append loop (shipping data from source to sink) HBT checkLogicalNodeConfigs HBT notices that it needs a new config HBT fetches and enqueues FCD // (this actually has already been fetched by checkLogicalNode step) CCT unblocks dequeuing a flumeConfigData(FCD) CCT calls logicalnode's checkConfig(FCD) CCT believes the last good FCD is (nullSource, nullSink at unixtime 0) // (this is because we didn't update the version number earlier) CCT attempts to load the FCD because it thinks it is new. CCT instaintiates new instances of source and sink. CCT attempts to nicely shutdown previously PT. (via stop call on driver thread) CCT attempts to join on PT CCT times out on join and then issues a thread cancel (depending on where PT is, it may not catch this interruption) CCT attempts to start new direct driver thread (PT2) PT2's open attempt fails because network port already bound (doesn't throw exception) PT1 reaches a close call (doesn't close right away because queue is full of stuff) CCT finally sets last good config. ThriftSource is in a closing state, and neither PT1 or PT2 are functioning properly. There are a few paper cuts but a quick fix seems to be to properly set the version properly on during the spawn so that the second call that gets the same FCD isn't initiated. Another likely more robust approach is to make to make spawn happen in the single CCT thread instead of the HBT.
          Hide
          satish added a comment -

          I could fix this , but I replaced the call to loadNodeDriver() in LogicalNodeManager.spawn to nd.checkConfig(data), I also overloaded NodesManager.spawn() to NodesManager.spawn(String name, FlumeConfigData data), with the above call to checkConfigData, the first data flow (which is caused by checkLogicalNodes() in the HeartBeatThread will result in the lastGoodConfig being set to a valid config (since checkConfig eventually calls loadConfig), so when the second data flow which is started from dequeueCheckConfig reaches checkConfig, the config is already valid and set, and hence it does not try to start the driver again.

          I understand that this might be a temporary fix, clearly Eric's approach to separate the config handling to a single place in the code is a better approach, but for immediate testing this is a smaller/faster fix right ?

          Show
          satish added a comment - I could fix this , but I replaced the call to loadNodeDriver() in LogicalNodeManager.spawn to nd.checkConfig(data), I also overloaded NodesManager.spawn() to NodesManager.spawn(String name, FlumeConfigData data), with the above call to checkConfigData, the first data flow (which is caused by checkLogicalNodes() in the HeartBeatThread will result in the lastGoodConfig being set to a valid config (since checkConfig eventually calls loadConfig), so when the second data flow which is started from dequeueCheckConfig reaches checkConfig, the config is already valid and set, and hence it does not try to start the driver again. I understand that this might be a temporary fix, clearly Eric's approach to separate the config handling to a single place in the code is a better approach, but for immediate testing this is a smaller/faster fix right ?
          Hide
          Jonathan Hsieh added a comment -

          This bug seems intermittent.

          A few weeks ago I could not duplicate this bug. (2 weeks ago)

          I was able to duplicate it today a few times, several times in a row but can't seem to duplicate it anymore.

          For other reasons, I upped my dev vm from 1 cpu to 4 cpus and now am unable to recreate.

          Show
          Jonathan Hsieh added a comment - This bug seems intermittent. A few weeks ago I could not duplicate this bug. (2 weeks ago) I was able to duplicate it today a few times, several times in a row but can't seem to duplicate it anymore. For other reasons, I upped my dev vm from 1 cpu to 4 cpus and now am unable to recreate.
          Hide
          Jonathan Hsieh added a comment -

          This bug is caused by a regression due to FLUME-595, FLUME-596, FLUME-569 group of patches.

          Show
          Jonathan Hsieh added a comment - This bug is caused by a regression due to FLUME-595 , FLUME-596 , FLUME-569 group of patches.
          Hide
          E. Sammer added a comment -

          The rough version of what I did as an emergency fix for a specific user is now up at https://review.cloudera.org/r/1896/

          Show
          E. Sammer added a comment - The rough version of what I did as an emergency fix for a specific user is now up at https://review.cloudera.org/r/1896/
          Hide
          E. Sammer added a comment -

          The NPE isn't critical here and the interruption and binding issues are a side effect of the the command being executed twice. The binding problem is the result of the port already being in use (because a previous incarnation of the source / sink is still running).

          Show
          E. Sammer added a comment - The NPE isn't critical here and the interruption and binding issues are a side effect of the the command being executed twice. The binding problem is the result of the port already being in use (because a previous incarnation of the source / sink is still running).
          Hide
          Jonathan Hsieh added a comment -

          I was able to duplicate the NPE in the FLUME-706.log file. I did this by putting a break point on the beginning of the run method in DirectDriver. A driver thread can be spawned but not have an opened the source yet. When a report is requested the server field within the driver thread is still null and causes the NPE. Working on finding the interleaving that causes the interrupted exception problem which seems to cause the follow-on binding problem.

          Show
          Jonathan Hsieh added a comment - I was able to duplicate the NPE in the FLUME-706 .log file. I did this by putting a break point on the beginning of the run method in DirectDriver. A driver thread can be spawned but not have an opened the source yet. When a report is requested the server field within the driver thread is still null and causes the NPE. Working on finding the interleaving that causes the interrupted exception problem which seems to cause the follow-on binding problem.
          Hide
          E. Sammer added a comment - - edited

          I tried this again with flume built from master using no custom code / tests to ensure it wasn't just something I was doing.

          [flume localhost:35873:45678] getconfigs
          NODE    FLOW            SOURCE                  SINK                            
          n1      default-flow    execPeriodic("ls", 500) agentBESink("10.0.3.227", 12345)
          n2      default-flow    collectorSource(12345)  text("n2.out")  
          

          I mapped both n1 and n2 to 10.0.3.227.

          Here's a snippet of the logs (master and node running in two processes but intermingled output) that show that mapping n1 to 10.0.3.227 causes the node to open the n1 source / sink (logical node 547), immediately close it, and open it again under a different logical node thread (logical node 553).

          2011-07-18 16:23:58,889 [exec-thread] INFO master.CommandManager: Executing command: map [10.0.3.227, n1]
          2011-07-18 16:23:58,896 [main-EventThread] INFO master.ZooKeeperConfigStore: Nodemaps were updated - reloading
          2011-07-18 16:23:58,900 [main-EventThread] INFO master.ZooKeeperConfigStore: Loading nodes from: /flume-nodes/nodes0000000010
          2011-07-18 16:23:59,991 [Heartbeat] INFO agent.LivenessManager: check logicalNode:10.0.3.227
          2011-07-18 16:23:59,991 [Heartbeat] INFO agent.LivenessManager: check logicalNode:n22011-07-18 16:23:59,991 [Heartbeat] INFO agent.LivenessManager: check logicalNode:n1
          2011-07-18 16:23:59,994 [Heartbeat] INFO agent.LogicalNodeManager: creating new logical node n12011-07-18 16:24:00,029 [logicalNode n1-547] INFO thrift.ThriftEventSink: ThriftEventSink to 10.0.3.227:12345 opened
          2011-07-18 16:24:00,029 [logicalNode n1-547] INFO debug.InsistentOpenDecorator: Opened StubbornAppend on try 02011-07-18 16:24:00,030 [logicalNode n1-547] INFO thrift.ThriftEventSink: ThriftEventSink on port 12345 closed
          2011-07-18 16:24:00,032 [Thread-225] INFO util.InputStreamPipe: Input stream pipe interrupted2011-07-18 16:24:00,032 [Thread-226] INFO util.InputStreamPipe: Input stream pipe interrupted
          2011-07-18 16:24:00,039 [Check config] INFO agent.LogicalNode: Node config successfully set to com.cloudera.flume.conf.FlumeConfigData@5b8b0b4a
          2011-07-18 16:24:00,068 [logicalNode n1-553] INFO thrift.ThriftEventSink: ThriftEventSink to 10.0.3.227:12345 opened
          2011-07-18 16:24:00,068 [logicalNode n1-553] INFO debug.InsistentOpenDecorator: Opened StubbornAppend on try 010.0.3.227 [INFO Mon Jul 18 16:24:00 PDT 2011] { execcmd : STDOUT } { procsource : ls } { service : 1702389091 'exec' } CHANGELOG.html\nDEVNOT
          ES\nLICENSE\nNOTICE\nREADME\nRELEASENOTES\nbin\nconf\ncontrib\ndocs\nlib\nn2.out\nwebapps\n
          
          Show
          E. Sammer added a comment - - edited I tried this again with flume built from master using no custom code / tests to ensure it wasn't just something I was doing. [flume localhost:35873:45678] getconfigs NODE FLOW SOURCE SINK n1 default -flow execPeriodic( "ls" , 500) agentBESink( "10.0.3.227" , 12345) n2 default -flow collectorSource(12345) text( "n2.out" ) I mapped both n1 and n2 to 10.0.3.227. Here's a snippet of the logs (master and node running in two processes but intermingled output) that show that mapping n1 to 10.0.3.227 causes the node to open the n1 source / sink (logical node 547), immediately close it, and open it again under a different logical node thread (logical node 553). 2011-07-18 16:23:58,889 [exec-thread] INFO master.CommandManager: Executing command: map [10.0.3.227, n1] 2011-07-18 16:23:58,896 [main-EventThread] INFO master.ZooKeeperConfigStore: Nodemaps were updated - reloading 2011-07-18 16:23:58,900 [main-EventThread] INFO master.ZooKeeperConfigStore: Loading nodes from: /flume-nodes/nodes0000000010 2011-07-18 16:23:59,991 [Heartbeat] INFO agent.LivenessManager: check logicalNode:10.0.3.227 2011-07-18 16:23:59,991 [Heartbeat] INFO agent.LivenessManager: check logicalNode:n22011-07-18 16:23:59,991 [Heartbeat] INFO agent.LivenessManager: check logicalNode:n1 2011-07-18 16:23:59,994 [Heartbeat] INFO agent.LogicalNodeManager: creating new logical node n12011-07-18 16:24:00,029 [logicalNode n1-547] INFO thrift.ThriftEventSink: ThriftEventSink to 10.0.3.227:12345 opened 2011-07-18 16:24:00,029 [logicalNode n1-547] INFO debug.InsistentOpenDecorator: Opened StubbornAppend on try 02011-07-18 16:24:00,030 [logicalNode n1-547] INFO thrift.ThriftEventSink: ThriftEventSink on port 12345 closed 2011-07-18 16:24:00,032 [ Thread -225] INFO util.InputStreamPipe: Input stream pipe interrupted2011-07-18 16:24:00,032 [ Thread -226] INFO util.InputStreamPipe: Input stream pipe interrupted 2011-07-18 16:24:00,039 [Check config] INFO agent.LogicalNode: Node config successfully set to com.cloudera.flume.conf.FlumeConfigData@5b8b0b4a 2011-07-18 16:24:00,068 [logicalNode n1-553] INFO thrift.ThriftEventSink: ThriftEventSink to 10.0.3.227:12345 opened 2011-07-18 16:24:00,068 [logicalNode n1-553] INFO debug.InsistentOpenDecorator: Opened StubbornAppend on try 010.0.3.227 [INFO Mon Jul 18 16:24:00 PDT 2011] { execcmd : STDOUT } { procsource : ls } { service : 1702389091 'exec' } CHANGELOG.html\nDEVNOT ES\nLICENSE\nNOTICE\nREADME\nRELEASENOTES\nbin\nconf\ncontrib\ndocs\nlib\nn2.out\nwebapps\n
          Hide
          E. Sammer added a comment -

          Attached log showing the behavior described. The code that triggered this was:

          node.start();
          master.serve();
          
          Thread.sleep(5000);
          
          master.submit(new Command("config", "10.0.3.227", "rpcSource(12346)", "console"));
          
          Thread.sleep(5000);
          
          node.stop();
          

          All directories were wiped out prior to running the test.

          Show
          E. Sammer added a comment - Attached log showing the behavior described. The code that triggered this was: node.start(); master.serve(); Thread .sleep(5000); master.submit( new Command( "config" , "10.0.3.227" , "rpcSource(12346)" , "console" )); Thread .sleep(5000); node.stop(); All directories were wiped out prior to running the test.

            People

            • Assignee:
              Jonathan Hsieh
              Reporter:
              E. Sammer
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development