Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-2364

netcat source and HDFS sink. Performance problem

    XMLWordPrintableJSON

Details

    • Test
    • Status: Resolved
    • Major
    • Resolution: Invalid
    • None
    • None
    • Configuration
    • None

    Description

      1. We have a csv file, size ~ 1GB
      2. We tried to store it to HDFS using hadoop fs -put. It took ~10 seconds.
      3. We try to use Flume 1.2 with netcat source and HFDS sink and we get serious perfomance problem. It takes ~ 20 minutes to store file. Also HDFS sink doesn't store it to single files. It create a lot of files, size of each is ~2 MB.

      Our goal is:
      1. send csv files to HDFS. We send file a1.csv to flume and get a1.csv in HDFS.
      2. We do send these files one by one.
      3. We want HDFS sink to close file after it was been received.

      Here is our configuration:

      httpptpt.sources = httpptpt_src
      httpptpt.channels = httpptpt_channel
      httpptpt.sinks = httpptpt_sink

      1. источники
        httpptpt.sources.httpptpt_src.type = netcat
        httpptpt.sources.httpptpt_src.bind = 10.66.48.23
        httpptpt.sources.httpptpt_src.port = 6969
        httpptpt.sources.httpptpt_src.ack-every-event = false
        #default size is 512B
        #httpptpt.sources.httpptpt_src.max-line-length = 4096
        httpptpt.sources.httpptpt_src.channels = httpptpt_channel
      1. channel
        httpptpt.channels.httpptpt_channel.type = memory
        #Seems like we don't understand how it works With default values it doesn't work (capacity=100, transaction capacity= 100). Memory channel has no room for storing incomming lines
        #httpptpt.channels.httpptpt_channel.capacity = 100000
        #httpptpt.channels.httpptpt_channel.transactionCapacity = 1000
        #Defaul is 3 sec
        #httpptpt.channels.httpptpt_channel.keep-alive = 1
      1. sink
        httpptpt.sinks.httpptpt_sink.channel = httpptpt_channel
        httpptpt.sinks.httpptpt_sink.type = hdfs
        httpptpt.sinks.httpptpt_sink.hdfs.path = hdfs://10.66.48.23/user/httpptpt/
        httpptpt.sinks.httpptpt_sink.hdfs.fileType = DataStream
        httpptpt.sinks.httpptpt_sink.hdfs.writeFormat = Writable
        httpptpt.sinks.httpptpt_sink.hdfs.filePrefix = httpptpt
        httpptpt.sinks.httpptpt_sink.hdfs.threadsPoolSize = 10
        #We want HDFS sink roll temp file after source stops to emit lines
        #httpptpt.sinks.httpptpt_sink.hdfs.rollSize = 10485760000
        httpptpt.sinks.httpptpt_sink.hdfs.rollSize = 0
        #httpptpt.sinks.httpptpt_sink.hdfs.rollCount = 6000000
        httpptpt.sinks.httpptpt_sink.hdfs.rollCount = 0
        httpptpt.sinks.httpptpt_sink.hdfs.rollInterval = 0
        #??? Source doesn't emit messages for 10 seconds, then rool the file
        httpptpt.sinks.httpptpt_sink.hdfs.idleTimeout = 10

      What do we do wrong?

      Attachments

        Activity

          People

            Unassigned Unassigned
            prawin530 Praveen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified