Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-27323 Iceberg: malformed manifest file or list can cause data breach
  3. HIVE-27927

Iceberg: Authorize location of Iceberg data reads to tables

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.0.0-alpha-2
    • None
    • Iceberg integration
    • None

    Description

      This is the second phase of the solution to prevent data breach via Iceberg data file reads in custom locations via authorizing the data locations to tables...

       

      Before detailing one possible solution let’s list which parts and why not enlisted for restrictions:

      • We need to keep allowing users to write Iceberg tables from Spark, aka to have direct file-system access. We can’t really restrict this type of access.
      • Users having direct access to their Iceberg table’s file-system location will be able to put whatever they want into their Iceberg table’s metadata/manifest. We can’t really restrict these types of changes.
      • Users will have access to Iceberg Metadata Tables as part of Iceberg’s API, if not there could be other options to get the information or just guess data file locations. We can’t really restrict this information.

      The proposed solution has two components:

      1. Restrict the engines - Hive & Impala - from reading data files from locations that are not authorized for the table
      2. Restrict the users to extend the authorized data locations for the tables.

      1. Restricting the data-locations the engines are allowed to read from

      Somehow the engine needs to decide if the data file location coming from the manifest file is valid and data can be read from it or it’s malicious and should be rejected. Remember, that we can’t block users creating malicious manifest files, so the engine must be able to allow/reject locations. 

      To be able to decide what locations are allowed and what locations are to be rejected - aka fail / error-out the query -, there must be some reference information.

      Right now the Iceberg specification declares these table properties:

      • write.data.path - Base location for data files (e.g. table location + /data)
      • write.metadata.path - Base location for metadata files (table location + /metadata)

      These are the bare-minimums the engines “could” allow to read data from - but…

      …but there are two problems with these properties:

      • First of all these are reflecting only the current state of the table, while changing them will be used by new snapshots, older data files can still located on previous data locations and those are still referenced by manifest files in a valid way
      • Second is that these properties live in the Iceberg table’s metadata which can be manipulated by the user having write access to the file-system the same way as the manifest file.

       

      Based on these problems, the requirements for the reference information the engines can compare data locations to are:

      • Allow multiple location definitions
      • Store them in a way where definitions/alterations can be authorized but still easily accessible by the engines

       

      Note: Why multiple locations and not a single one? - to remain consistent with Iceberg’s API and provide its full feature-set and having one single location value or an array of them iterating through them should not be a big difference in implementation.

       

      The proposed solution for this is to store the list of allowed locations for a given table in HMS Catalog, optionally only in HMS (that way this would not need Iceberg spec change), as part of the table definition. Anything stored in HMS on both CREATE and ALTER paths can be Authorized like how we currently authorize the “metadata_location” value.

      E.g. There could be a table property - like read.allowed.data.paths - stored in HMS and used by the engines to use that value to push down to the execution and validate if no other locations are tried to be accessed.

      CREATE TABLE …
      STORED BY ICEBERG TBLPROPERTIES (
        ‘read.allowed.data.paths’=’/some/old/loc/icebergtbl1,/new/loc/icebergtbl1’ )

       

      Note: to make onboarding users to Iceberg easier and faster, the property could inherit the default location of the data path on table creation - but only if it’s the ‘warehouse’ based default location! - similar to how the current metadata_location _authorization is exempted when its location is based on the warehouse specific default table location.
      _

      2. Restricting the data-locations the users are allowed to set

       

      With the above part of the proposal we can make sure that engines read the allowed list of locations from a property that can be properly authorized - from HMS. The second part is to secure what the tables can have as allowed data locations.

      Currently there is a similar Authorization when the Iceberg table is created/altered: authorizing what “metadata_location” path the user is allowed to create in / modify to.
      On Hive side this was fixed via HIVE-27322 and enhanced to reduce overhead for default locations via HIVE-27714.

      As there is already a “location” authorization, restricting users to define malicious values for the new allowed list could utilize the same Authorizer by adding each of the defined paths into the same Authorization request. 

      If a table is to be relocated or would be just created with a custom data location, the user needs to have a custom Ranger Policy authorizing such configuration for the given table(s). This is the same today when a user would want to create/modify the “metadata_location” - it needs a custom policy. As the Ranger Policy used to authorize the location - storage-type / RWStorage - allows multiple locations within the same policy, there can be a single one covering all possible locations for the specific table(s).

      Attachments

        Activity

          People

            Unassigned Unassigned
            jkovacs@HW Janos Kovacs
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: