Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-3203

Spooling dir source leaks records from a file when a corresponding .COMPLETED file already present

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.6.0
    • None
    • Sinks+Sources
    • None

    Description

      Here are the steps to reproduce:

      1) Use below config in the flume agent:

      tier1.sources  = source1
      tier1.channels = channel1
      tier1.sinks    = sink1
      
      tier1.channels.channel1.type   = memory
      tier1.channels.channel1.capacity = 1000
      tier1.channels.channel1.transactionCapacity = 1000
      
      tier1.sinks.sink1.channel      = channel1
      tier1.sources.source1.channels = channel1
      
      tier1.sources.source1.type     = spooldir
      tier1.sources.source1.spoolDir = /home/systest/spoolDir
      tier1.sources.source1.fileHeader = true
      
      tier1.sinks.sink1.type         = hdfs
      tier1.sinks.sink1.hdfs.path =/tmp/spoolEvnts
      tier1.sinks.sink1.hdfs.filePrefix = events-
      

      2) When the agent is started with the above config, use below command to move a sample text file in spooling dir:

      mv Sample-text-file-50kb.txt /home/systest/spoolDir
      

      agent will start processing the events and output can be seen in HDFS dir:

      $ hdfs dfs -ls /tmp/spoolEvnts | uniq | wc -l
      37
      

      3) Again move same file into spooling dir using below command:

      mv /tmp/Sample-text-file-50kb.txt /home/systest/spoolDir
      

      This time flume will raise an exception as below but continue processing the file again:

      2017-12-21 00:00:27,581 INFO org.apache.flume.client.avro.ReliableSpoolingFileEventReader: Preparing to move file /home/systest/spoolDir/Sample-text-file-50kb.txt to /home/systest/spoolDir/Sample-text-file-50kb.txt.COMPLETED
      2017-12-21 00:00:27,582 ERROR org.apache.flume.source.SpoolDirectorySource: FATAL: Spool Directory source source1: { spoolDir: /home/systest/spoolDir }: Uncaught exception in SpoolDirectorySource thread. Restart or reconfigure Flume to continue processing.
      java.lang.IllegalStateException: File name has been re-used with different files. Spooling assumptions violated for /home/systest/spoolDir/Sample-text-file-50kb.txt.COMPLETED
      	at org.apache.flume.client.avro.ReliableSpoolingFileEventReader.rollCurrentFile(ReliableSpoolingFileEventReader.java:463)
      	at org.apache.flume.client.avro.ReliableSpoolingFileEventReader.retireCurrentFile(ReliableSpoolingFileEventReader.java:414)
      	at org.apache.flume.client.avro.ReliableSpoolingFileEventReader.readEvents(ReliableSpoolingFileEventReader.java:326)
      	at org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:250)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      2017-12-21 00:00:31,265 INFO org.apache.flume.sink.hdfs.BucketWriter: Closing /tmp/spoolEvnts/events-.1513843202836.tmp
      2017-12-21 00:00:31,275 INFO org.apache.flume.sink.hdfs.BucketWriter: Renaming /tmp/spoolEvnts/events-.1513843202836.tmp to /tmp/spoolEvnts/events-.1513843202836
      2017-12-21 00:00:31,293 INFO org.apache.flume.sink.hdfs.BucketWriter: Creating /tmp/spoolEvnts/events-.1513843202837.tmp
      2017-12-21 00:00:31,321 INFO org.apache.flume.sink.hdfs.BucketWriter: Closing /tmp/spoolEvnts/events-.1513843202837.tmp
      

      And if we check at HDFS it shows the below file count :

      $ hdfs dfs -ls /tmp/spoolEvnts | uniq | wc -l
      72
      

      Based on the doc it should not process the file which has same name with .COMPLETED suffix. It causes duplicate records on sink.

      Attachments

        Activity

          People

            Unassigned Unassigned
            umesh9794@gmail.com Umesh Chaudhary
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: