Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-2451

HDFS Sink Cannot Reconnect After NameNode Restart

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 1.4.0
    • None
    • File Channel, Sinks+Sources
    • 8 node CDH 4.2.2 (2.0.0-cdh4.2.2) cluster
      All cluster machines are running Ubuntu 12.04 x86_64

    Description

      I am testing a simple flume setup with a Sequence Generator Source, a File Channel, and an HDFS Sink (see my flume.conf below). This configuration works as expected until I reboot the cluster’s NameNode or until I restart the HDFS service on the cluster. At this point, it appears that the Flume Agent cannot reconnect to HDFS and must be manually restarted.

      Here is our flume.conf:

      appserver.sources = rawtext
      appserver.channels = testchannel
      appserver.sinks = test_sink

      appserver.sources.rawtext.type = seq
      appserver.sources.rawtext.channels = testchannel

      appserver.channels.testchannel.type = file
      appserver.channels.testchannel.capacity = 10000000
      appserver.channels.testchannel.minimumRequiredSpace = 214748364800
      appserver.channels.testchannel.checkpointDir = /Users/aoneill/Desktop/testchannel/checkpoint
      appserver.channels.testchannel.dataDirs = /Users/aoneill/Desktop/testchannel/data
      appserver.channels.testchannel.maxFileSize = 20000000

      appserver.sinks.test_sink.type = hdfs
      appserver.sinks.test_sink.channel = testchannel
      appserver.sinks.test_sink.hdfs.path = hdfs://cluster01:8020/user/aoneill/flumetest
      appserver.sinks.test_sink.hdfs.closeTries = 3
      appserver.sinks.test_sink.hdfs.filePrefix = events-
      appserver.sinks.test_sink.hdfs.fileSuffix = .avro
      appserver.sinks.test_sink.hdfs.fileType = DataStream
      appserver.sinks.test_sink.hdfs.writeFormat = Text
      appserver.sinks.test_sink.hdfs.inUsePrefix = inuse-
      appserver.sinks.test_sink.hdfs.inUseSuffix = .avro
      appserver.sinks.test_sink.hdfs.rollCount = 100000
      appserver.sinks.test_sink.hdfs.rollInterval = 30
      appserver.sinks.test_sink.hdfs.rollSize = 10485760

      These are the two error message that the Flume Agent outputs constantly after the restart:

      2014-08-26 10:47:24,572 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.sink.hdfs.AbstractHDFSWriter.isUnderReplicated(AbstractHDFSWriter.java:96)] Unexpected error while checking replication factor
      java.lang.reflect.InvocationTargetException
      at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:606)
      at org.apache.flume.sink.hdfs.AbstractHDFSWriter.getNumCurrentReplicas(AbstractHDFSWriter.java:162)
      at org.apache.flume.sink.hdfs.AbstractHDFSWriter.isUnderReplicated(AbstractHDFSWriter.java:82)
      at org.apache.flume.sink.hdfs.BucketWriter.shouldRotate(BucketWriter.java:452)
      at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:387)
      at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:392)
      at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
      at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
      at java.lang.Thread.run(Thread.java:744)
      Caused by: java.net.ConnectException: Connection refused
      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
      at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
      at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:525)
      at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1253)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:891)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:881)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:982)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:779)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)

      and

      2014-08-26 10:47:29,592 (SinkRunner-PollingRunner-DefaultSinkProcessor) [WARN - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:418)] HDFS IO error
      java.net.ConnectException: Connection refused
      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
      at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
      at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:525)
      at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1253)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:891)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:881)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:982)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:779)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)

      I can provide additional information if needed. Thank you very much for any insight you are able to provide into this problem.

      Attachments

        1. FLUME-2451.v3.patch
          9 kB
          Gareth Davis
        2. FLUME-2451.v2.patch
          6 kB
          Roshan Naik
        3. FLUME-2451.patch
          2 kB
          Roshan Naik

        Issue Links

          Activity

            People

              venkatnrangan Venkat Ranganathan
              AndroidOatmeal Andrew O'Neill
              Votes:
              4 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated: