I'm using flume elasticsearch-sink to transfer logs from ec2 instances to elasticsearch and I get duplicate entries for numerous documents.
I've noticed this issue when I was sending a specific number of log lines to elasticsearch using flume and then I was counting them using kibana to verify that all of them arrived. Most of the time, especially when multiple flume instances were used, I was getting duplicate entries. e.g. instead of receiving 10000 documents from an instance, I was receiving 10060.
Duplication level seems to be proportional to the number of instances sending log data simultaneously. e.g. with 3 flume instances I get 10060, with 50 flume instances I get 10300.
Is duplication something that I should expect when using flume elasticsearch-sink?
There is a doRollback() method that is called on transaction failure but I think that it updates only the local flume channel and not elasticsearch.
Any info/suggestions would be appreciated.