Uploaded image for project: 'Atlas'
  1. Atlas
  2. ATLAS-3655

Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations

    XMLWordPrintableJSON

Details

    • Task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.1.0, 3.0.0
    • None
    • None

    Description

      Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations. Currently, Spark Atlas Connector uses 'spark_process' as a top-level type for a Spark session, thus it's being updated for multiple operations within the same session.

      The following statements:

      spark.sql("create table table_1(col1 int,col2 string)");
      spark.sql("create table table_2 as select * from table_1");
      

      result in the next correct lineage:

      table1 ------> spark_process1 -------> table2

      but executing similar statements in the same spark session:

      spark.sql("create table table_3(col1 int,col2 string)"); 
      spark.sql("create table table_4 as select * from table_3");
      

      result in the same 'spark_process' being updated and the lineage now connects all the 4 tables(see screenshot in the attachments).

       

      The proposal is to create a 'spark_application' entity and associate all 'spark_process' entities (created within that session) to it.

      Attachments

        1. Screenshot from 2020-03-03 16-09-39.png
          66 kB
          Vladislav Glinskiy

        Issue Links

          Activity

            People

              Unassigned Unassigned
              vladglinskiy Vladislav Glinskiy
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 10m
                  3h 10m