>> Scenario 1) Suppose I want to have a flow where we have source data at an agent, that then goes to some intermediate node for processing and then goes to a collector. With the current model, I would tag all the nodes to be in the same flow. How does an agent know it needs to sent intermediate processing node instead of the collector?
>Isn't that what the sink is for? The sink encodes information about the next hop. The flow just restricts the set of nodes (for load balancing) that the next hop can be drawn from.
I'm a little confused by the first question. When you speak of sinks, which sink are you talking about – the rpc sink on the agent, the processor or the hdfs/hbase/flumebase on a collector?
The current properties and constraints on flows 1.0 are that 1) "flow-is-a-set-of-nodes" and 2) "nodes-are-only-in-one-flow". The problem really shows up when there are processor in the middle. It is both a source (for the collector) and a sink (for the agent). If flows 1.0 are really focused on edge sets, between two sets constraint #2 gets in the way because the intermediate node would need to be part of two flows. If flows 1.0 are focused on end-to-end situations, then constraint #1 complicates the logic needed to express automatically generated topologies.
On interesting suggestion is to have something like a logicalSources and logicalSink(logicalNodeId) that uses a tier id as instead of logicalNodeId. This would give a name for the set of edges between tiers of nodes.
> This is a problem if the query 'tier' belongs to more than one flow, otherwise it fits fine with the abstraction. Tiers are a nice abstraction as well; when combined with flows they give horizontal and vertical abstractions. The relationship between tiers and flows need to be explicit: it sounds like each flow is comprised of tiers and no tier can belong to more than one flow. That sounds a good plan to me, as it seems like you still need flows to handle end-to-end constructs like reliability and compression.
Let's make sure we are talking about the same thing when we say flow. I'll define a flow to be a single potentially multiple hop path from root (sources) to a final sink. With this definition, there would be one flow in scenario 1 and two flows in scenario 2 example: an [agent -> collector] flow and an [agent -> query ] flow.
Can you give an example of the problem you are talking about? Is the concern you bring up something the DAG situation? [sources1 -> query],
[sources2 -> query] (DAG vs Tree?)? Or is it that now that the someone has to tell agent to send to both the query and collector destinations?
My first thought is that a source tier sending to multiple destination tiers is reasonable. We may need to enforce a restriction where only be one end-to-end flow is reliable, but is doable with knowledge of the tier-to-tier flows.
I think that the DAG situation, two different source tiers going to the same destination tier, is reasonable as well. I've seen production scenarios where folks are collecting two different sources of data and sending them to the same collector that uses the output bucketing feature to demux data when it writes to hdfs.
I see generally cycles as a problem and out of scope for this.
I think there are different contexts to apply properties on a flow or part of a flow. Ideally this is something that could be encoded and stored at the master. Some properties like reliability are likely an end-to-end properties of a flow. Compression, batching and encryption are likely only a property on an edge between tiers. For example, let's say we have a scenario where we send data to a "local" relay using cheap batching and lzo compression, but then want to use "expensive" gzip compression and encryption for a wan connection to a remote collector. We want end-to-end reliability overall, but different compression algorithms on different tier connections.