Affects Version/s: 0.6-incubating
Fix Version/s: 0.7-incubating
Apache Storm Integration with Apache Atlas (incubating)
Apache Storm is a distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. The process is essentially a DAG of nodes, which is called topology.
Apache Atlas is a metadata repository that enables end-to-end data lineage, search and associate business classification.
The goal of this integration is to at minimum push the operational topology metadata along with the underlying data source(s), target(s), derivation processes and any available business context so Atlas can capture the lineage for this topology.
It would also help to support custom user annotations per node in the topology.
There are 2 parts in this process detailed below:
Data model to represent the concepts in Storm
Storm Bridge to update metadata in Atlas
A data model is represented as a Type in Atlas. It contains the descriptions of various nodes in the DAG, such as spouts and bolts and the corresponding source and target types. These need to be expressed as Types in Atlas type system. At the least, we need to create types for:
Storm topology containing spouts, bolts, etc. with associations between them
Source (typically Kafka, etc.)
Target (typically Hive, HBase, HDFS, etc.)
You can take a look at the data model code for Hive. Storm should only be simpler than Hive from a data modeling perspective.
Pushing Metadata into Atlas
There are 2 parts to the bridge:
This is a one-time import for Storm to list all the active topologies and push the metadata into Atlas to address cases where Storm deployments exist before Atlas.
You can refer to the bridge code for Hive.
Atlas needs to be notified when a new topology is registered successfully in Storm or when someone changes the definition of an existing topology.
You can refer to the hook code for Hive.
Example use case:
Custom annotations associated with each node in the topology.
For example: Data Quality Rules, Error Handling, etc. A set of annotations that enumerates rules handling nulls– all nulls for a column get filtered, etc.