Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-2241

Spooling Directory Source doesn't handle 2 byte UTF-8 encoded characters correctly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 1.4.0
    • None
    • None
    • None
    • Debian 6.0.5

    Description

      I have a flume agent set up with a spooling directory source sinking data to cassandra.

      I'm collecting web data writing a line in the log file for each request then once the log file has been rotated is dropped into the spooling directory ready for flume to start processing it. All data is valid json as its validated prior to it being written to the log file.

      Sending a mixture of different sized requests from 9-15k seems fine. Generated a log file of over 400Mb and it all sinked correctly.

      I'm currently logging a 19k request and this is when things start to break. It only gets as far as 1800th request in the file and the next one is truncated.

      Changed the sink to a file-roll sink and it only gets as far as 29Mb

      I have profiled it and it's not running out of memory. I want to know if there are any limitations on the spooling directory source.

      Has anyone tried dropping a file with similarly large requests and experienced a similar issue.

      Any pointers would be greatly appreciated. My flume config is as follows

      flume_conf
      orion.sources = spoolDir
      orion.channels = fileChannel
      orion.sinks= cassandra
      
      orion.channels.fileChannel.type = file
      orion.channels.fileChannel.capacity = 1000000
      orion.channels.fileChannel.transactionCapacity = 100
      orion.channels.fileChannel.keep-alive = 60
      orion.channels.fileChannel.write-timeout = 60
      
      orion.sinks.cassandra.type = com.btoddb.flume.sinks.cassandra.CassandraSink
      orion.sinks.cassandra.hosts = <cluster node ip>
      orion.sinks.cassandra.cluster_name = fake_cluster
      orion.sinks.cassandra.port = 9160
      orion.sinks.cassandra.keyspace-name = Keysp
      orion.sinks.cassandra.records-colfam = <table>
      
      orion.sources.spoolDir.type = spooldir
      orion.sources.spoolDir.spoolDir = /var/log/orion/flumeSpooling
      orion.sources.spoolDir.deserializer = LINE
      orion.sources.spoolDir.inputCharset = UTF-8
      orion.sources.spoolDir.deserializer.maxLineLength = 20000000
      orion.sources.spoolDir.deletePolicy = never
      orion.sources.spoolDir.batchSize = 100
      orion.sources.spoolDir.interceptors = addSrc addHost addTimestamp addUUID
      
      orion.sources.spoolDir.interceptors.addSrc.type = regex_extractor
      orion.sources.spoolDir.interceptors.addSrc.regex = \"service\"\:\"([^"]*)
      orion.sources.spoolDir.interceptors.addSrc.serializers = s1
      orion.sources.spoolDir.interceptors.addSrc.serializers.s1.name = src
      
      orion.sources.spoolDir.interceptors.addUUID.type = regex_extractor
      orion.sources.spoolDir.interceptors.addUUID.regex = \"uuid\"\:\"([^"]*)
      orion.sources.spoolDir.interceptors.addUUID.serializers = s1
      orion.sources.spoolDir.interceptors.addUUID.serializers.s1.name = key
      
      orion.sources.spoolDir.interceptors.addHost.type = org.apache.flume.interceptor.HostInterceptor$Builder
      orion.sources.spoolDir.interceptors.addHost.preserveExisting = false
      orion.sources.spoolDir.interceptors.addHost.useIP = true
      orion.sources.spoolDir.interceptors.addHost.hostHeader = host
      
      orion.sources.spoolDir.interceptors.addTimestamp.type = regex_extractor
      orion.sources.spoolDir.interceptors.addTimestamp.regex = \"timestamp\"\:\"([^"]*)
      orion.sources.spoolDir.interceptors.addTimestamp.serializers = s1
      orion.sources.spoolDir.interceptors.addTimestamp.serializers.s1.name = timestamp
      
      orion.sources.spoolDir.channels = fileChannel
      orion.sinks.cassandra.channel = fileChannel
      

      Is this potentially a bug?.. If not tried can someone try to recreate - I hope the same error would occur.

      Dont hesitate to contact me for further info.

      Viktor

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              viktort Viktor Trako
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: