Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-3221

using spooling dir and hdfs sink(hdfs.codec = lzop), found data loss when set hdfs.filePrefix = %{basename}

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.8.0
    • notrack
    • Sinks+Sources
    • None
    • Java version "1.8.0_151"

      Java(TM) SE Runtime Environment (build 1.8.0_151-b12)

      Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)

      Hadoop 2.6.3

      lzop native lib: hadoop-lzo-0.4.20-SNAPSHOT.jar

    Description

      a flume process configured with the following parameters cause this problem:

      Configuration

      spool_flume1.sources = spool-source-spool
      spool_flume1.channels = hdfs-channel-spool
      spool_flume1.sinks = hdfs-sink-spool
      spool_flume1.sources.spool-source-spool.type = spooldir
      spool_flume1.sources.spool-source-spool.channels = hdfs-channel-spool
      spool_flume1.sources.spool-source-spool.spoolDir = /home/test/flume_log
      spool_flume1.sources.spool-source-spool.recursiveDirectorySearch = true
      spool_flume1.sources.spool-source-spool.fileHeader = true
      spool_flume1.sources.spool-source-spool.deserializer = LINE
      spool_flume1.sources.spool-source-spool.deserializer.maxLineLength = 100000000
      spool_flume1.sources.spool-source-spool.inputCharset = UTF-8
      spool_flume1.sources.spool-source-spool.basenameHeader = true
      
      spool_flume1.channels.hdfs-channel-spool.type = memory
      spool_flume1.channels.hdfs-channel-spool.keep-alive = 60
      spool_flume1.sinks.hdfs-sink-spool.channel = hdfs-channel-spool
      spool_flume1.sinks.hdfs-sink-spool.type = hdfs
      spool_flume1.sinks.hdfs-sink-spool.hdfs.writeFormat = Text
      spool_flume1.sinks.hdfs-sink-spool.hdfs.fileType = CompressedStream
      spool_flume1.sinks.hdfs-sink-spool.hdfs.codeC = lzop
      spool_flume1.sinks.hdfs-sink-spool.hdfs.threadsPoolSize = 1
      spool_flume1.sinks.hdfs-sink-spool.hdfs.callTimeout = 100000
      spool_flume1.sinks.hdfs-sink-spool.hdfs.idleTimeout = 36
      spool_flume1.sinks.hdfs-sink-spool.hdfs.useLocalTimeStamp = true
      spool_flume1.sinks.hdfs-sink-spool.hdfs.filePrefix = %{basename}
      spool_flume1.sinks.hdfs-sink-spool.hdfs.path = /user/test/flume_test
      spool_flume1.sinks.hdfs-sink-spool.hdfs.rollCount = 0
      spool_flume1.sinks.hdfs-sink-spool.hdfs.rollSize = 134217728
      spool_flume1.sinks.hdfs-sink-spool.hdfs.rollInterval = 0
      spool_flume1.sources.spool-source-spool.includePattern = log.*-1_2018.*$
      spool_flume1.sources.spool-source-spool.batchSize = 100
      spool_flume1.channels.hdfs-channel-spool.capacity = 1000
      spool_flume1.channels.hdfs-channel-spool.transactionCapacity = 100
      

      test data size add up to 4.2 G, amounts to 5271962 lines

       

      expected data stored as lzop format and named files as %{basename}_%{LocalTimeStamp} on hdfs.

      However,  found sink data mixed in different files in my tests and total uploaded data amounts is less than local data

      Test cases listed below:

      • using DataStream, no matter set filePrefix = %{basename} or not, uploading normally
      • using CompressedStream, hdfs.codec = lzop
        • set filePrefix as default, uploading normally
        • set filePrefix = %{basename}, data mixed and loss

      when shut down my flume agent process, it`s weird that it prints correct amounts in flume.log but actually uploaded data is not that much. Log file attached in the end.

      Attachments

        1. flume_shutdown.log
          7 kB
          Nicho Zhou

        Activity

          People

            Unassigned Unassigned
            nicho92 Nicho Zhou
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: