ReportLineageToAtlas with certain setup can throw the following exception, failing to send reports to Atlas (but keeping on trying infinitely):
The exception is coming from
and the problem is that an Atlas Processor entity has a nifi_queue and a nifi_data DataSet input entity with the same qualifiedName.
It can happen when the NiFi processor (P_Subject) that corresponds to the Atlas Processor entity
- has an inbound connection that is represented in Atlas by a nifi_queue entity. (There are multiple ways to enforce this, one by making sure the origin processor of the inbound queue (P_Origin, where P_Origin -> P_Subject) has a connection to another processor as well, like P_Origin -> P_Other, so the flow looks like this:
- also generates an input (CREATE, RECEIVE or FETCH) provenance event on its own and does not have a special input (like fs_path or hive_table), just uses the generic nifi_data Atlas type for representing its input (called "unknown" processor in the documentation of the reporting task)
See attached atlas_duplicate_key.xml for an example flow template.
Here InvokeHTTP has an input nifi_queue entity in Atlas (see explanation above, for more details see the Path Separation Logic section in the reporting task docs). Its qualifiedName is processorUUID@clustername (derived from the next processor's UUID, so InvokeHTTP's UUID in this case).
It also sends the incoming flowfile in the HTTP request and creates another flowfile from the HTTP response which generates a FETCH event which in turn generates a nifi_data entity in Atlas. Its qualifiedName is also processorUUID@clustername (using the processor's UUID that generates the event, so InvokeHTTP's UUID).
These two entities having the same qualifiedName causes the duplicate key error.