Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-286

DFO mode does not detect network failure

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.9.1
    • 0.9.2
    • Sinks+Sources
    • None

    Description

      Collector configured as:

      exec config auctionlogsink 'collectorSource(35853)' '

      { gunzip => collectorSink( "hdfs://clmaster01/bidder_data/raw/auction_logs/%Y%m%d/%H/", "auctionLog-", 300000 ) }

      '

      Agent configured as:

      exec config nym7-bidlog 'syslogTcp(5140)' '

      { gzip => agentDFOSink( "clmaster01", 35853 ) }

      '

      We first observed this problem in production when our collector server went down. I've since observed it in a test environment too. If you simply stop the collector process, the agent immediately notices and starts writing events to disk:

      2010-10-19 17:40:09,549 INFO com.cloudera.flume.handlers.debug.InsistentOpenDecorator: open attempt 0 failed, backoff (1000ms): Failed to open thrift event sink at 192.168.1.43:35855 : java.net.ConnectException: Connection refused

      However, in the event of a network failure (or failure of the machine to respond in any way, as was observed in our production scenario), simulated by pulling out the ethernet cable from the machine, the agent node continues as if nothing has gone wrong.

      In my test scenario, when I plugged the cable back in, some of the events were received, presumably because they were caught in a TCP buffer. At no point, however, did the agent detect the situation, write anything to disc or attempt to re-transmit.

      Attachments

        Issue Links

          Activity

            People

              jmhsieh Jonathan Hsieh
              flume_jamesg Disabled imported user
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: