Flume
  1. Flume
  2. FLUME-2364

netcat source and HDFS sink. Performance problem

    Details

    • Type: Test Test
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Invalid
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Configuration
    • Labels:
      None

      Description

      1. We have a csv file, size ~ 1GB
      2. We tried to store it to HDFS using hadoop fs -put. It took ~10 seconds.
      3. We try to use Flume 1.2 with netcat source and HFDS sink and we get serious perfomance problem. It takes ~ 20 minutes to store file. Also HDFS sink doesn't store it to single files. It create a lot of files, size of each is ~2 MB.

      Our goal is:
      1. send csv files to HDFS. We send file a1.csv to flume and get a1.csv in HDFS.
      2. We do send these files one by one.
      3. We want HDFS sink to close file after it was been received.

      Here is our configuration:

      httpptpt.sources = httpptpt_src
      httpptpt.channels = httpptpt_channel
      httpptpt.sinks = httpptpt_sink

      1. источники
        httpptpt.sources.httpptpt_src.type = netcat
        httpptpt.sources.httpptpt_src.bind = 10.66.48.23
        httpptpt.sources.httpptpt_src.port = 6969
        httpptpt.sources.httpptpt_src.ack-every-event = false
        #default size is 512B
        #httpptpt.sources.httpptpt_src.max-line-length = 4096
        httpptpt.sources.httpptpt_src.channels = httpptpt_channel
      1. channel
        httpptpt.channels.httpptpt_channel.type = memory
        #Seems like we don't understand how it works With default values it doesn't work (capacity=100, transaction capacity= 100). Memory channel has no room for storing incomming lines
        #httpptpt.channels.httpptpt_channel.capacity = 100000
        #httpptpt.channels.httpptpt_channel.transactionCapacity = 1000
        #Defaul is 3 sec
        #httpptpt.channels.httpptpt_channel.keep-alive = 1
      1. sink
        httpptpt.sinks.httpptpt_sink.channel = httpptpt_channel
        httpptpt.sinks.httpptpt_sink.type = hdfs
        httpptpt.sinks.httpptpt_sink.hdfs.path = hdfs://10.66.48.23/user/httpptpt/
        httpptpt.sinks.httpptpt_sink.hdfs.fileType = DataStream
        httpptpt.sinks.httpptpt_sink.hdfs.writeFormat = Writable
        httpptpt.sinks.httpptpt_sink.hdfs.filePrefix = httpptpt
        httpptpt.sinks.httpptpt_sink.hdfs.threadsPoolSize = 10
        #We want HDFS sink roll temp file after source stops to emit lines
        #httpptpt.sinks.httpptpt_sink.hdfs.rollSize = 10485760000
        httpptpt.sinks.httpptpt_sink.hdfs.rollSize = 0
        #httpptpt.sinks.httpptpt_sink.hdfs.rollCount = 6000000
        httpptpt.sinks.httpptpt_sink.hdfs.rollCount = 0
        httpptpt.sinks.httpptpt_sink.hdfs.rollInterval = 0
        #??? Source doesn't emit messages for 10 seconds, then rool the file
        httpptpt.sinks.httpptpt_sink.hdfs.idleTimeout = 10

      What do we do wrong?

        Activity

        Hide
        Ashish Paliwal added a comment -

        User ML question

        Show
        Ashish Paliwal added a comment - User ML question
        Hide
        Ashish Paliwal added a comment -

        Please ask question on User Mailing Lists More info at http://flume.apache.org/mailinglists.html

        Show
        Ashish Paliwal added a comment - Please ask question on User Mailing Lists More info at http://flume.apache.org/mailinglists.html

          People

          • Assignee:
            Unassigned
            Reporter:
            Praveen
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 24h
              24h
              Remaining:
              Remaining Estimate - 24h
              24h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development