Description
Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations. Currently, Spark Atlas Connector uses 'spark_process' as a top-level type for a Spark session, thus it's being updated for multiple operations within the same session.
The following statements:
spark.sql("create table table_1(col1 int,col2 string)"); spark.sql("create table table_2 as select * from table_1");
result in the next correct lineage:
table1 ------> spark_process1 -------> table2
but executing similar statements in the same spark session:
spark.sql("create table table_3(col1 int,col2 string)"); spark.sql("create table table_4 as select * from table_3");
result in the same 'spark_process' being updated and the lineage now connects all the 4 tables(see screenshot in the attachments).
The proposal is to create a 'spark_application' entity and associate all 'spark_process' entities (created within that session) to it.
Attachments
Attachments
Issue Links
- links to