Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
0.8.2
-
None
Description
Column lineage generated by Atlas Hive hook is incorrect for certain queries - like the following INSERT:
CREATE TABLE source_tbl(col_001 INT, col_002 INT, col_003 INT); CREATE TABLE target_tbl(col_001 INT, col_002 INT, col_003 INT); INSERT INTO target_tbl SELECT v1.col_001, v1.col_002, v1.col_003 FROM (SELECT col_001, col_002, col_003, ROW_NUMBER() OVER() AS r_num FROM source_tbl) v1;
In this case, lineage for each column in target_tbl shows input from all columns in source_tbl. In this case, the lineage information provided to post hooks (like Atlas hook) contains 3 entries, one for each column in target_tbl. Note the dependency for each column has all columns of the source_tbl.
DependencyKey=default.target_tbl:FieldSchema(name:col_001, type:int, comment:null) Dependency=[SCRIPT] [default.source_tbl(src):FieldSchema(name:col_001, type:int, comment:null), default.source_tbl(src):FieldSchema(name:col_002, type:int, comment:null), default.source_tbl(src):FieldSchema(name:col_003, type:int, comment:null), default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, type:bigint, comment:), default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, type:string, comment:), default.source_tbl(src):FieldSchema(name:ROW__ID, type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:) ]; DependencyKey=default.target_tbl:FieldSchema(name:col_002, type:int, comment:null) Dependency=[SCRIPT] [default.source_tbl(src):FieldSchema(name:col_001, type:int, comment:null), default.source_tbl(src):FieldSchema(name:col_002, type:int, comment:null), default.source_tbl(src):FieldSchema(name:col_003, type:int, comment:null), default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, type:bigint, comment:), default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, type:string, comment:), default.source_tbl(src):FieldSchema(name:ROW__ID, type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:) ]; DependencyKey=default.target_tbl:FieldSchema(name:col_003, type:int, comment:null) Dependency=[SCRIPT] [default.source_tbl(src):FieldSchema(name:col_001, type:int, comment:null), default.source_tbl(src):FieldSchema(name:col_002, type:int, comment:null), default.source_tbl(src):FieldSchema(name:col_003, type:int, comment:null), default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, type:bigint, comment:), default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, type:string, comment:), default.source_tbl(src):FieldSchema(name:ROW__ID, type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:) ];
When INSERT statement doesn't include "ROW_NUMBER() OVER() AS r_num", the lineage details look correct.
This issue is seen in Hive version 1; but not in Hive2 or Hive3.
Attachments
Attachments
Issue Links
- is caused by
-
HIVE-20633 Incorrect column lineage: each output column has input from *all columns* of the input table
- Open