Flume
  1. Flume
  2. FLUME-1654

Event always stay at Channel even if HDFS Sink's file has been removed

    Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: v1.2.0, v1.3.0, v1.4.0
    • Fix Version/s: None
    • Component/s: Sinks+Sources
    • Labels:
      None

      Description

      Environment of data flow : One HDFS Sink is consuming event from Channel. The destination file of that HDFS Sink has been deleted by mistake. Sink retried at 'append' and got failure, also it's cannot close that file. The failure is infinitely at Sink. No other ways to consume events from Channel(single Channel is mapping single Sink).

        Activity

        Hide
        Denny Ye added a comment -

        This problem was caused by FileNotFoundException from NameNode. When exception was happening (close method also failed), no chance to modify status 'isOpen' with false value.

        In my opinion, each sub-class of HDFSWriter should provide 'exists' method to identify the file status from NameNode. If that file has been deleted or moved, we can reopen new target file to avoid event accumulation at Channel.

        I have modified code and tested this method, it worked well.

        Show
        Denny Ye added a comment - This problem was caused by FileNotFoundException from NameNode. When exception was happening (close method also failed), no chance to modify status 'isOpen' with false value. In my opinion, each sub-class of HDFSWriter should provide 'exists' method to identify the file status from NameNode. If that file has been deleted or moved, we can reopen new target file to avoid event accumulation at Channel. I have modified code and tested this method, it worked well.
        Hide
        Suhas Satish added a comment -

        I am hitting the same issue too in production environment.

        Here's the stack trace -
        10 Oct 2013 16:55:42,092 WARN [SinkRunner-PollingRunner-DefaultSinkProcessor]
        (org.apache.flume.sink.hdfs.BucketWriter.append:430) - Caught IOException
        while closing file
        (maprfs:///flume_import/2013/10/16//logdata-2013-10-16-50-03.1381449008596.tmp).
        Exception follows.
        java.io.IOException: 2049.112.5249612
        /flume_import/2013/10/16/logdata-2013-10-16-50-03.1381449008596.tmp (Stale file
        handle)

        WARN [SinkRunner-PollingRunner-DefaultSinkProcessor]
        (org.apache.flume.sink.hdfs.HDFSEventSink.process:418) - HDFS IO error
        java.io.IOException: 2049.112.5249612

        The solution would be to wrap around another nested try catch block in HDFSEventSink.java around
        bucketWriter.append(event);

        Show
        Suhas Satish added a comment - I am hitting the same issue too in production environment. Here's the stack trace - 10 Oct 2013 16:55:42,092 WARN [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.BucketWriter.append:430) - Caught IOException while closing file (maprfs:///flume_import/2013/10/16//logdata-2013-10-16-50-03.1381449008596.tmp). Exception follows. java.io.IOException: 2049.112.5249612 /flume_import/2013/10/16/logdata-2013-10-16-50-03.1381449008596.tmp (Stale file handle) WARN [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.HDFSEventSink.process:418) - HDFS IO error java.io.IOException: 2049.112.5249612 The solution would be to wrap around another nested try catch block in HDFSEventSink.java around bucketWriter.append(event);
        Hide
        Suhas Satish added a comment -

        Steps to reproduce:

        1. Start the Flume agent as follows:
        bin/flume-ng agent -n agent -c conf -f conf/flume-agent.conf

        2. Generate log output using loggen (provided by syslog-ng):
        loggen -I 30 -s 300 -r 900 <hostname> 13073

        3. Verify flume output is generated

        4. Delete the flume output files while loggen is still running and Flume is
        generating the sink output.

        5. Check flume.log and you'll immediately
        see errors such as these:

        02 Oct 2013 12:22:26,515 ERROR [hdfs-MapRFS-call-runner-1]
        (com.mapr.fs.LoggerProxy.error:18) - Write failed for file:
        /flume_import/2013/10/12/logdata-2013-10-12-22-25.1380741745326.tmp, error:
        Stale file handle
        02 Oct 2013 12:22:26,520 WARN [SinkRunner-PollingRunner-DefaultSinkProcessor]
        (org.apache.flume.sink.hdfs.HDFSEventSink.process:418) - HDFS IO error
        java.io.IOException: 2049.209.116198
        /flume_import/2013/10/12/logdata-2013-10-12-22-25.1380741745326.tmp (Stale file
        handle)
        at com.mapr.fs.Inode.throwIfFailed(Inode.java:233)
        at com.mapr.fs.Inode.flushPages(Inode.java:347)
        at com.mapr.fs.Inode.syncInternal(Inode.java:439)
        at com.mapr.fs.Inode.syncUpto(Inode.java:448)
        at com.mapr.fs.MapRFsOutStream.sync(MapRFsOutStream.java:244)
        at
        com.mapr.fs.MapRFsDataOutputStream.sync(MapRFsDataOutputStream.java:68)
        at
        org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:946)
        at
        org.apache.flume.sink.hdfs.HDFSSequenceFile.sync(HDFSSequenceFile.java:107)
        at
        org.apache.flume.sink.hdfs.BucketWriter$5.call(BucketWriter.java:356)
        at
        org.apache.flume.sink.hdfs.BucketWriter$5.call(BucketWriter.java:353)
        at
        org.apache.flume.sink.hdfs.BucketWriter$8$1.run(BucketWriter.java:536)
        at
        org.apache.flume.sink.hdfs.BucketWriter.runPrivileged(BucketWriter.java:160)
        at
        org.apache.flume.sink.hdfs.BucketWriter.access$1000(BucketWriter.java:56)
        at
        org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:533)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at
        java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
        02 Oct 2013 12:22:26,521 INFO [agent-shutdown-hook]
        (org.apache.flume.lifecycle.LifecycleSupervisor.stop:79) - Stopping lifecycle
        supervisor 10
        02 Oct 2013 12:22:26,523 INFO [agent-shutdown-hook]
        (org.apache.flume.channel.file.FileChannel.stop:328) - Stopping FileChannel
        fileChannel-1

        { dataDirs: [/root/.flume/file-channel/data] }

        ...
        02 Oct 2013 12:22:26,523 INFO [agent-shutdown-hook]
        (org.apache.flume.channel.file.Log.shutdownWorker:836) - Attempting to
        shutdown background worker.
        02 Oct 2013 12:22:26,524 INFO [agent-shutdown-hook]
        (org.apache.flume.channel.file.LogFile$Writer.close:327) - Closing
        /root/.flume/file-channel/data/log-1
        02 Oct 2013 12:22:26,526 INFO [agent-shutdown-hook]
        (org.apache.flume.channel.file.LogFile$RandomReader.close:455) - Closing
        RandomReader /root/.flume/file-channel/data/log-1
        02 Oct 2013 12:22:26,532 INFO [agent-shutdown-hook]
        (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:139) - Component
        type: CHANNEL, name: fileChannel-1 stopped
        02 Oct 2013 12:22:26,532 INFO [agent-shutdown-hook]
        (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:145) - Shutdown
        Metric for type: CHANNEL, name: fileChannel-1. channel.start.time ==
        1380740120284
        02 Oct 2013 12:22:26,532 INFO [agent-shutdown-hook]
        (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:151) - Shutdown
        Metric for type: CHANNEL, name: fileChannel-1. channel.stop.time ==
        1380741746532
        02 Oct 2013 12:22:26,532 INFO [agent-shutdown-hook]
        (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:167) - Shutdown
        Metric for type: CHANNEL, name: fileChannel-1. channel.capacity == 100000
        02 Oct 2013 12:22:26,532 INFO [agent-shutdown-hook]
        (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:167) - Shutdown
        Metric for type: CHANNEL, name: fileChannel-1. channel.current.size == 527
        02 Oct 2013 12:22:26,532 INFO [agent-shutdown-hook]
        (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:167) - Shutdown
        Metric for type: CHANNEL, name: fileChannel-1. channel.event.put.attempt ==
        37634
        02 Oct 2013 12:22:26,533 INFO [agent-shutdown-hook]
        (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:167) - Shutdown
        Metric for type: CHANNEL, name: fileChannel-1. channel.event.put.success ==
        37634
        02 Oct 2013 12:22:26,533 INFO [agent-shutdown-hook]
        (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:167) - Shutdown
        Metric for type: CHANNEL, name: fileChannel-1. channel.event.take.attempt ==
        38302
        02 Oct 2013 12:22:26,533 INFO [agent-shutdown-hook]
        (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:167) - Shutdown
        Metric for type: CHANNEL, name: fileChannel-1. channel.event.take.success ==
        37107
        02 Oct 2013 12:22:26,533 INFO [agent-shutdown-hook]
        (org.apache.flume.sink.hdfs.HDFSEventSink.stop:437) - Closing
        maprfs:///flume_import/2013/10/11//logdata-2013-10-11-56-25
        02 Oct 2013 12:22:26,536 INFO [agent-shutdown-hook]

        Show
        Suhas Satish added a comment - Steps to reproduce: 1. Start the Flume agent as follows: bin/flume-ng agent -n agent -c conf -f conf/flume-agent.conf 2. Generate log output using loggen (provided by syslog-ng): loggen -I 30 -s 300 -r 900 <hostname> 13073 3. Verify flume output is generated 4. Delete the flume output files while loggen is still running and Flume is generating the sink output. 5. Check flume.log and you'll immediately see errors such as these: 02 Oct 2013 12:22:26,515 ERROR [hdfs-MapRFS-call-runner-1] (com.mapr.fs.LoggerProxy.error:18) - Write failed for file: /flume_import/2013/10/12/logdata-2013-10-12-22-25.1380741745326.tmp, error: Stale file handle 02 Oct 2013 12:22:26,520 WARN [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.HDFSEventSink.process:418) - HDFS IO error java.io.IOException: 2049.209.116198 /flume_import/2013/10/12/logdata-2013-10-12-22-25.1380741745326.tmp (Stale file handle) at com.mapr.fs.Inode.throwIfFailed(Inode.java:233) at com.mapr.fs.Inode.flushPages(Inode.java:347) at com.mapr.fs.Inode.syncInternal(Inode.java:439) at com.mapr.fs.Inode.syncUpto(Inode.java:448) at com.mapr.fs.MapRFsOutStream.sync(MapRFsOutStream.java:244) at com.mapr.fs.MapRFsDataOutputStream.sync(MapRFsDataOutputStream.java:68) at org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:946) at org.apache.flume.sink.hdfs.HDFSSequenceFile.sync(HDFSSequenceFile.java:107) at org.apache.flume.sink.hdfs.BucketWriter$5.call(BucketWriter.java:356) at org.apache.flume.sink.hdfs.BucketWriter$5.call(BucketWriter.java:353) at org.apache.flume.sink.hdfs.BucketWriter$8$1.run(BucketWriter.java:536) at org.apache.flume.sink.hdfs.BucketWriter.runPrivileged(BucketWriter.java:160) at org.apache.flume.sink.hdfs.BucketWriter.access$1000(BucketWriter.java:56) at org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:533) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 02 Oct 2013 12:22:26,521 INFO [agent-shutdown-hook] (org.apache.flume.lifecycle.LifecycleSupervisor.stop:79) - Stopping lifecycle supervisor 10 02 Oct 2013 12:22:26,523 INFO [agent-shutdown-hook] (org.apache.flume.channel.file.FileChannel.stop:328) - Stopping FileChannel fileChannel-1 { dataDirs: [/root/.flume/file-channel/data] } ... 02 Oct 2013 12:22:26,523 INFO [agent-shutdown-hook] (org.apache.flume.channel.file.Log.shutdownWorker:836) - Attempting to shutdown background worker. 02 Oct 2013 12:22:26,524 INFO [agent-shutdown-hook] (org.apache.flume.channel.file.LogFile$Writer.close:327) - Closing /root/.flume/file-channel/data/log-1 02 Oct 2013 12:22:26,526 INFO [agent-shutdown-hook] (org.apache.flume.channel.file.LogFile$RandomReader.close:455) - Closing RandomReader /root/.flume/file-channel/data/log-1 02 Oct 2013 12:22:26,532 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:139) - Component type: CHANNEL, name: fileChannel-1 stopped 02 Oct 2013 12:22:26,532 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:145) - Shutdown Metric for type: CHANNEL, name: fileChannel-1. channel.start.time == 1380740120284 02 Oct 2013 12:22:26,532 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:151) - Shutdown Metric for type: CHANNEL, name: fileChannel-1. channel.stop.time == 1380741746532 02 Oct 2013 12:22:26,532 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:167) - Shutdown Metric for type: CHANNEL, name: fileChannel-1. channel.capacity == 100000 02 Oct 2013 12:22:26,532 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:167) - Shutdown Metric for type: CHANNEL, name: fileChannel-1. channel.current.size == 527 02 Oct 2013 12:22:26,532 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:167) - Shutdown Metric for type: CHANNEL, name: fileChannel-1. channel.event.put.attempt == 37634 02 Oct 2013 12:22:26,533 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:167) - Shutdown Metric for type: CHANNEL, name: fileChannel-1. channel.event.put.success == 37634 02 Oct 2013 12:22:26,533 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:167) - Shutdown Metric for type: CHANNEL, name: fileChannel-1. channel.event.take.attempt == 38302 02 Oct 2013 12:22:26,533 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:167) - Shutdown Metric for type: CHANNEL, name: fileChannel-1. channel.event.take.success == 37107 02 Oct 2013 12:22:26,533 INFO [agent-shutdown-hook] (org.apache.flume.sink.hdfs.HDFSEventSink.stop:437) - Closing maprfs:///flume_import/2013/10/11//logdata-2013-10-11-56-25 02 Oct 2013 12:22:26,536 INFO [agent-shutdown-hook]
        Hide
        Suhas Satish added a comment -

        Tested with uploaded patch fix that the flume-agent continues to run and log successive loggen outputs even after a *.tmp file is deleted or its permissions changed.

        Show
        Suhas Satish added a comment - Tested with uploaded patch fix that the flume-agent continues to run and log successive loggen outputs even after a *.tmp file is deleted or its permissions changed.

          People

          • Assignee:
            Unassigned
            Reporter:
            Denny Ye
          • Votes:
            3 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:

              Development