Uploaded image for project: 'Atlas'
  1. Atlas
  2. ATLAS-2891

Incorrect column lineage: each output column has input from *all columns* of the input table

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 0.8.2
    • 0.8.3, 1.2.0, 2.0.0
    • atlas-intg
    • None

    Description

      Column lineage generated by Atlas Hive hook is incorrect for certain queries - like the following INSERT:

      CREATE TABLE source_tbl(col_001 INT, col_002 INT, col_003 INT);
      
      CREATE TABLE target_tbl(col_001 INT, col_002 INT, col_003 INT);
      
      INSERT INTO target_tbl SELECT v1.col_001, v1.col_002, v1.col_003 FROM (SELECT col_001, col_002, col_003, ROW_NUMBER() OVER() AS r_num FROM source_tbl) v1;
      
      

      In this case, lineage for each column in target_tbl shows input from all columns in source_tbl. In this case, the lineage information provided to post hooks (like Atlas hook) contains 3 entries, one for each column in target_tbl. Note the dependency for each column has all columns of the source_tbl.

      DependencyKey=default.target_tbl:FieldSchema(name:col_001, type:int, comment:null)
      Dependency=[SCRIPT]
                 [default.source_tbl(src):FieldSchema(name:col_001, type:int, comment:null),
                  default.source_tbl(src):FieldSchema(name:col_002, type:int, comment:null),
                  default.source_tbl(src):FieldSchema(name:col_003, type:int, comment:null),
                  default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, type:bigint, comment:),
                  default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, type:string, comment:),
                  default.source_tbl(src):FieldSchema(name:ROW__ID, type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:)
                 ];
       
      DependencyKey=default.target_tbl:FieldSchema(name:col_002, type:int, comment:null)
      Dependency=[SCRIPT]
                 [default.source_tbl(src):FieldSchema(name:col_001, type:int, comment:null),
                  default.source_tbl(src):FieldSchema(name:col_002, type:int, comment:null),
                  default.source_tbl(src):FieldSchema(name:col_003, type:int, comment:null),
                  default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, type:bigint, comment:),
                  default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, type:string, comment:),
                  default.source_tbl(src):FieldSchema(name:ROW__ID, type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:)
                 ];
       
      DependencyKey=default.target_tbl:FieldSchema(name:col_003, type:int, comment:null)
      Dependency=[SCRIPT]
                 [default.source_tbl(src):FieldSchema(name:col_001, type:int, comment:null),
                  default.source_tbl(src):FieldSchema(name:col_002, type:int, comment:null),
                  default.source_tbl(src):FieldSchema(name:col_003, type:int, comment:null),
                  default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, type:bigint, comment:),
                  default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, type:string, comment:),
                  default.source_tbl(src):FieldSchema(name:ROW__ID, type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:)
                 ];
      

      When INSERT statement doesn't include "ROW_NUMBER() OVER() AS r_num", the lineage details look correct.

      This issue is seen in Hive version 1; but not in Hive2 or Hive3.

      Attachments

        1. ATLAS-2891.png
          32 kB
          Madhan Neethiraj
        2. ATLAS-2891-branch-0.8.patch
          12 kB
          Madhan Neethiraj

        Issue Links

          Activity

            People

              madhan Madhan Neethiraj
              madhan Madhan Neethiraj
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: