Uploaded image for project: 'Apache Submarine'
  1. Apache Submarine
  2. SUBMARINE-638

Spark-security ranger plugin - Limit to be applied after masking projection

    XMLWordPrintableJSON

Details

    Description

      Let's say there is a query with a limit like below and value has to be masked

      SELECT key, value from default.src limit 10

      Then the plan looks like below

      == Parsed Logical Plan ==
      'GlobalLimit 10
      +- 'LocalLimit 10
         +- 'Project ['key, 'value]
            +- 'UnresolvedRelation `default`.`src`Project 
      
      == Optimized Logical Plan ==
      [key#36,HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFMaskShowLastN(value#37,4,x,x,x,-1,1) AS value#41]
      +- GlobalLimit 10
         +- LocalLimit 10
            +- SubmarineDataMasking
               +- HiveTableRelation `default`.`src`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#36, value#37]
      
      == Physical Plan ==
      Project [key#36, HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFMaskShowLastN(value#37,4,x,x,x,-1,1) AS value#41]
      +- *(2) GlobalLimit 10
         +- Exchange SinglePartition
            +- *(1) LocalLimit 10
               +- *(1) HiveTableScan [key#36, value#37], HiveTableRelation `default`.`src`, org.apache.hadoop.hive.serde2.OpenCSVSerde, [key#36, value#37]
      

      The above plan will read all the files in the table. This is because the optimised logical plan has a project over the limit. If the optimised logical plan has a limit after masking projection the physical plan will convert to have collectLimit and hence the collect will read only one file.

      == Parsed Logical Plan ==
      'GlobalLimit 10
      +- 'LocalLimit 10
         +- 'Project ['key, 'value]
            +- 'UnresolvedRelation `default`.`src`
      
      == Optimized Logical Plan ==
      GlobalLimit 10
      +- LocalLimit 10
         +- Project [key#36, HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFMaskShowLastN(value#37,4,x,x,x,-1,1) AS value#41]
            +- SubmarineDataMasking
               +- HiveTableRelation `default`.`src`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#36, value#37]
      
      == Physical Plan ==
      CollectLimit 10
         +- Project [key#36, HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFMaskShowLastN(value#37,4,x,x,x,-1,1) AS value#41]
            +- *(1) HiveTableScan [key#36, value#37], HiveTableRelation `default`.`src`, org.apache.hadoop.hive.serde2.OpenCSVSerde, [key#36, value#37]

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              harsha249 Tenneti Venkata Sri Harsha
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: