It occurs when using escapedCustomDfs, though it isn't on flush, as there are missing events all the way across the data source, some from the very beginning of the file and others mid-way through and towards the end. It doesn't appear to be a flushing problem. It is non-deterministic at the moment, so we can't devise a workaround or develop a means for ensuring the integrity of the file.
When we look at the stats for Flume, the number of events sent directly correlates with number in the source file, however in the version in HDFS a whole lot is missing. The other problem we're seeing is when the custom decorator we're written is included in the workflow, custom DFS complains that the attributes we've added aren't there, as if the body has either changed or is missing? This doesn't happen with the Text source. When we get time we'll do some more debug, deadlines are looming and we need to progress.
We're probably going to have a change of approach for the time being which is to only ingest nightly using 'text' when the log file rotates, kind of defeats our goals of streaming data as it arrives, but we are able to guarantee the integrity of the data. We have also experimented with vanilla tail as well, the problem is present in both versions - I assume TailDir invokes vanilla Tail when it finds a new file.
Thanks for your help,