[METRON-1460] Create a complementary non-split-join enrichment topology - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Done
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: 0.5.0
Labels:
None

Description

There are some deficiencies to the split/join topology.

It's hard to reason about
Understanding the latency of enriching a message requires looking at multiple bolts that each give summary statistics

The join bolt's cache is really hard to reason about when performance tuning
During spikes in traffic, you can overload the join bolt's cache and drop messages if you aren't careful
In general, it's hard to associate a cache size and a duration kept in cache with throughput and latency

There are a lot of network hops per message
Right now we are stuck at 2 stages of transformations being done (enrichment and threat intel). It's very possible that you might want stellar enrichments to depend on the output of other stellar enrichments. In order to implement this in split/join you'd have to create a cycle in the storm topology

I propose that we move to a model where we do enrichments in a single bolt in parallel using a static threadpool (e.g. multiple workers in the same process would share the threadpool). IN all other ways, this would be backwards compatible. A transparent drop-in for the existing enrichment topology.
There are some pros/cons about this too:

Pro
Easier to reason about from an individual message perspective
Architecturally decoupled from Storm
This sets us up if we want to consider other streaming technologies

Fewer bolts
spout -> enrichment bolt -> threatintel bolt -> output bolt

Way fewer network hops per message
currently 2n+1 where n is the number of enrichments used (if using stellar subgroups, each subgroup is a hop)

Easier to reason about from a performance perspective
We trade cache size and eviction timeout for threadpool size

We set ourselves up to have stellar subgroups with dependencies
i.e. stellar subgroups that depend on the output of other subgroups
If we do this, we can shrink the topology to just spout -> enrichment/threat intel -> output

Con
We can no longer tune stellar enrichments independent from HBase enrichments
To be fair, with enrichments moving to stellar, this is the case in the split/join approach too

No idea about performance
What I propose is to submit a PR that will deliver an alternative, completely backwards compatible topology for enrichment that you can use by adjusting the start_enrichment_topology.sh script to use remote-unified.yaml instead of remote.yaml. If we live with it for a while and have some good experiences with it, maybe we can consider retiring the old enrichment topology.

Attachments

Issue Links

links to

GitHub Pull Request #940

Activity

People

Assignee:: Casey Stella

Reporter:: Casey Stella

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/Feb/18 18:22

Updated:: 22/May/18 19:34

Resolved:: 22/May/18 19:21