Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Let's say there is a query with a limit like below and value has to be masked
SELECT key, value from default.src limit 10
Then the plan looks like below
== Parsed Logical Plan == 'GlobalLimit 10 +- 'LocalLimit 10 +- 'Project ['key, 'value] +- 'UnresolvedRelation `default`.`src`Project == Optimized Logical Plan == [key#36,HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFMaskShowLastN(value#37,4,x,x,x,-1,1) AS value#41] +- GlobalLimit 10 +- LocalLimit 10 +- SubmarineDataMasking +- HiveTableRelation `default`.`src`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#36, value#37] == Physical Plan == Project [key#36, HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFMaskShowLastN(value#37,4,x,x,x,-1,1) AS value#41] +- *(2) GlobalLimit 10 +- Exchange SinglePartition +- *(1) LocalLimit 10 +- *(1) HiveTableScan [key#36, value#37], HiveTableRelation `default`.`src`, org.apache.hadoop.hive.serde2.OpenCSVSerde, [key#36, value#37]
The above plan will read all the files in the table. This is because the optimised logical plan has a project over the limit. If the optimised logical plan has a limit after masking projection the physical plan will convert to have collectLimit and hence the collect will read only one file.
== Parsed Logical Plan == 'GlobalLimit 10 +- 'LocalLimit 10 +- 'Project ['key, 'value] +- 'UnresolvedRelation `default`.`src` == Optimized Logical Plan == GlobalLimit 10 +- LocalLimit 10 +- Project [key#36, HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFMaskShowLastN(value#37,4,x,x,x,-1,1) AS value#41] +- SubmarineDataMasking +- HiveTableRelation `default`.`src`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#36, value#37] == Physical Plan == CollectLimit 10 +- Project [key#36, HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFMaskShowLastN(value#37,4,x,x,x,-1,1) AS value#41] +- *(1) HiveTableScan [key#36, value#37], HiveTableRelation `default`.`src`, org.apache.hadoop.hive.serde2.OpenCSVSerde, [key#36, value#37]
Attachments
Issue Links
- links to