Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-649

flume looses events and blocks when (maybe) thrift rpc sink is too slow



    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Won't Fix
    • Affects Version/s: 0.9.3
    • Fix Version/s: 0.9.5
    • Component/s: Node
    • Environment:

      Ubuntu 10.04/10.10, 8GB RAM, ~ 750GB Disc, thrift 0.5


      We're using an rpcSource which has an rpcSink like this one:

      < rpcSink( "rpcserver", 9090 ) ? { diskFailover => { insistentAppend => { stubbornAppend =>

      { insistentOpen => rpcSink( "rpcserver", 9090 ) }

      } } } >

      When many flume nodes writes to this "rpcserver" in parallel and the rpcserver isn't quick enough to handle all incoming events as quick as they appear, the network buffer are running full so that with tcpdump/wireshark you see "TCP WindowFull" (see http://wiki.wireshark.org/TCP_Analyze_Sequence_Numbers). The problem: the flume node doesn't recognize this really quick and two problems appears:

      1. the flume node seems to send some time to the full node and it takes a while until it closes the connection and some events are lost.
      2. before the flume node restart the connection like this one:
      2011-04-23 23:34:20,940 INFO com.cloudera.flume.handlers.debug.StubbornAppendSink: Append failed java.net.SocketException: Broken pipe
      2011-04-23 23:34:20,940 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on port 9090 closed
      2011-04-23 23:34:20,940 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink open on port 9090 opened
      2011-04-23 23:34:20,940 INFO com.cloudera.flume.handlers.debug.InsistentOpenDecorator: Opened ThriftEventSink on try 0

      it needs much more time to receive events so our rpc clients which are connected to the flume node instance have many timeouts (we need it really quick).
      So maybe we're using flume wrong or the mechanism doesn't queue events but tries to send it directly through the pipe which isn't possible because of the
      slower rpc server. This blocking makes it unusable for us. Did we do something wrong or is it a flume related bug?





            • Assignee:
              flume_se Disabled imported user
            • Votes:
              0 Vote for this issue
              4 Start watching this issue


              • Created: