Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-20633

Incorrect column lineage: each output column has input from *all columns* of the input table

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 1.2.2
    • None
    • HiveServer2
    • None

    Description

      Column lineage details made available to post hook is incorrect for certain queries - like the following INSERT:

      CREATE TABLE source_tbl(col_001 INT, col_002 INT, col_003 INT);
      
      CREATE TABLE target_tbl(col_001 INT, col_002 INT, col_003 INT);
      
      INSERT INTO target_tbl SELECT v1.col_001, v1.col_002, v1.col_003 FROM (SELECT col_001, col_002, col_003, ROW_NUMBER() OVER() AS r_num FROM source_tbl) v1;
      
      

      Below are the details of the lineage given to post hooks (like Atlas hook) via HookContext.getLinfo(). It contains 3 entries, one for each target table column. Note the dependency for each column has all columns of the source tables.

      DependencyKey=default.target_tbl:FieldSchema(name:col_001, type:int, comment:null)
      Dependency=[SCRIPT]
                 [default.source_tbl(src):FieldSchema(name:col_001, type:int, comment:null),
                  default.source_tbl(src):FieldSchema(name:col_002, type:int, comment:null),
                  default.source_tbl(src):FieldSchema(name:col_003, type:int, comment:null),
                  default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, type:bigint, comment:),
                  default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, type:string, comment:),
                  default.source_tbl(src):FieldSchema(name:ROW__ID, type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:)
                 ];
       
      DependencyKey=default.target_tbl:FieldSchema(name:col_002, type:int, comment:null)
      Dependency=[SCRIPT]
                 [default.source_tbl(src):FieldSchema(name:col_001, type:int, comment:null),
                  default.source_tbl(src):FieldSchema(name:col_002, type:int, comment:null),
                  default.source_tbl(src):FieldSchema(name:col_003, type:int, comment:null),
                  default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, type:bigint, comment:),
                  default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, type:string, comment:),
                  default.source_tbl(src):FieldSchema(name:ROW__ID, type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:)
                 ];
       
      DependencyKey=default.target_tbl:FieldSchema(name:col_003, type:int, comment:null)
      Dependency=[SCRIPT]
                 [default.source_tbl(src):FieldSchema(name:col_001, type:int, comment:null),
                  default.source_tbl(src):FieldSchema(name:col_002, type:int, comment:null),
                  default.source_tbl(src):FieldSchema(name:col_003, type:int, comment:null),
                  default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, type:bigint, comment:),
                  default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, type:string, comment:),
                  default.source_tbl(src):FieldSchema(name:ROW__ID, type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:)
                 ];
      

      When INSERT statement doesn't include "ROW_NUMBER() OVER() AS r_num", the lineage details look correct.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              madhan Madhan Neethiraj
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: