Brian ONeill, I haven't thought too hard about distinct yet myself. Since I'm really only thinking about Trident and not storm in general, doing a distinct strictly within a batch is one straightforward option. Unfortunately, from a user standpoint, I think this would be (a) minimally useful and (b) confusing. Instead we could implement something like an approximate distinct using an LRU cache? Maybe even go so far as to implement a SQF (which I haven't read in its entirety yet): http://www.vldb.org/pvldb/vol6/p589-dutta.pdf?
Also, what about order by? In what sense is an unbounded stream ordered?
I absolutely do not want to tie the storm/trident execution engine to an external data store such as cassandra. Pig is supposed to be backend agnostic. Maybe the
default tap and sink can be Kafka (tap) and Cassandra (sink). Finally, it should be possible to run a pig script in storm local mode.
And Pradeep Gollakota I'm actually well on the way to having nested foreach working. They way I'm working it now is each LogicalExpressionPlan becomes its own Trident BaseFunction. Actually works quite nicely for now. I haven't gotten to aggregates yet. What I probably won't implement for the POC is the tap and sink.