The current demux-archive plumbing is quite complicated. At Berkeley, we need something much simpler.
duplicate suppression in archiver
Simple sink archiver.
Copies all the .done files out of the sink, runs an archiver MapReduce job, then merges output of that job into archive, renaming files to avoid collision.
Intended use is to run once every day or two, to empty out sink.
A future enhancement, once we have appends, is to actually merge files during promotion, and not just rename to avoid collision.
If there's no Demux, then the purpose of Chukwa will be just to collect logs, and store them in a single jumbled mix of all the log record types?
No. The archiver, by default in this patch, will group by cluster, day and datatype. Which is well suited to our use case, which is mapreduce analytics of logs.
Revised, fixes a few unit test problems.
Taking silence for consent, I just committed this.