Details
-
Test
-
Status: Resolved
-
Major
-
Resolution: Invalid
-
None
-
None
-
None
Description
1. We have a csv file, size ~ 1GB
2. We tried to store it to HDFS using hadoop fs -put. It took ~10 seconds.
3. We try to use Flume 1.2 with netcat source and HFDS sink and we get serious perfomance problem. It takes ~ 20 minutes to store file. Also HDFS sink doesn't store it to single files. It create a lot of files, size of each is ~2 MB.
Our goal is:
1. send csv files to HDFS. We send file a1.csv to flume and get a1.csv in HDFS.
2. We do send these files one by one.
3. We want HDFS sink to close file after it was been received.
Here is our configuration:
httpptpt.sources = httpptpt_src
httpptpt.channels = httpptpt_channel
httpptpt.sinks = httpptpt_sink
- источники
httpptpt.sources.httpptpt_src.type = netcat
httpptpt.sources.httpptpt_src.bind = 10.66.48.23
httpptpt.sources.httpptpt_src.port = 6969
httpptpt.sources.httpptpt_src.ack-every-event = false
#default size is 512B
#httpptpt.sources.httpptpt_src.max-line-length = 4096
httpptpt.sources.httpptpt_src.channels = httpptpt_channel
- channel
httpptpt.channels.httpptpt_channel.type = memory
#Seems like we don't understand how it works With default values it doesn't work (capacity=100, transaction capacity= 100). Memory channel has no room for storing incomming lines
#httpptpt.channels.httpptpt_channel.capacity = 100000
#httpptpt.channels.httpptpt_channel.transactionCapacity = 1000
#Defaul is 3 sec
#httpptpt.channels.httpptpt_channel.keep-alive = 1
- sink
httpptpt.sinks.httpptpt_sink.channel = httpptpt_channel
httpptpt.sinks.httpptpt_sink.type = hdfs
httpptpt.sinks.httpptpt_sink.hdfs.path = hdfs://10.66.48.23/user/httpptpt/
httpptpt.sinks.httpptpt_sink.hdfs.fileType = DataStream
httpptpt.sinks.httpptpt_sink.hdfs.writeFormat = Writable
httpptpt.sinks.httpptpt_sink.hdfs.filePrefix = httpptpt
httpptpt.sinks.httpptpt_sink.hdfs.threadsPoolSize = 10
#We want HDFS sink roll temp file after source stops to emit lines
#httpptpt.sinks.httpptpt_sink.hdfs.rollSize = 10485760000
httpptpt.sinks.httpptpt_sink.hdfs.rollSize = 0
#httpptpt.sinks.httpptpt_sink.hdfs.rollCount = 6000000
httpptpt.sinks.httpptpt_sink.hdfs.rollCount = 0
httpptpt.sinks.httpptpt_sink.hdfs.rollInterval = 0
#??? Source doesn't emit messages for 10 seconds, then rool the file
httpptpt.sinks.httpptpt_sink.hdfs.idleTimeout = 10
What do we do wrong?