Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-14

Add column level encryption to ORC files

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      It would be useful to support column level encryption in ORC files. Since each column and its associated index is stored separately, encrypting a column separately isn't difficult. In terms of key distribution, it would make sense to use an external server like the one in HADOOP-9331.

        Issue Links

          Activity

          Hide
          owen.omalley Owen O'Malley added a comment - - edited

          Laugh Ignore my previous comment.

          I am getting closer, I need to write this up for the website, but the general direction is:

          • Add support for encrypting columns where the writer adds two alternatives into the file.
            a. Encrypted original data
            b. Unencrypted masked data
          • The format change is backwards compatible where old readers will get the unencrypted masked values.
          • It will use the Hadoop KMS by default, although it may be overridden.
          • Encryption will be AES (128 or 256 bit) in CTR mode, which allows seeks.
          • Different columns may use different master keys. Each writer will generate a random file id that is used to create a unique encryption key for the column in that file. To read an encrypted column, the user will need to have the KMS decrypt the column's encryption key.
          • The file and stripe statistics will be encrypted for the encrypted columns. However, the list of streams in the stripe footer will not be encrypted.
          • Masking of data may have several forms:
            a. Nullify - make all values null
            b. Redact - replace strings and numbers with replacements based on character classes ('x' for letters, '9' for numbers, etc.)
            c. SHA256 - replace strings and numbers with SHA256 of the value
            d. Custom - user defined method
          Show
          owen.omalley Owen O'Malley added a comment - - edited Laugh Ignore my previous comment. I am getting closer, I need to write this up for the website, but the general direction is: Add support for encrypting columns where the writer adds two alternatives into the file. a. Encrypted original data b. Unencrypted masked data The format change is backwards compatible where old readers will get the unencrypted masked values. It will use the Hadoop KMS by default, although it may be overridden. Encryption will be AES (128 or 256 bit) in CTR mode, which allows seeks. Different columns may use different master keys. Each writer will generate a random file id that is used to create a unique encryption key for the column in that file. To read an encrypted column, the user will need to have the KMS decrypt the column's encryption key. The file and stripe statistics will be encrypted for the encrypted columns. However, the list of streams in the stripe footer will not be encrypted. Masking of data may have several forms: a. Nullify - make all values null b. Redact - replace strings and numbers with replacements based on character classes ('x' for letters, '9' for numbers, etc.) c. SHA256 - replace strings and numbers with SHA256 of the value d. Custom - user defined method
          Hide
          owen.omalley Owen O'Malley added a comment -

          I've started working on this. I'll post a patch this week.

          Show
          owen.omalley Owen O'Malley added a comment - I've started working on this. I'll post a patch this week.
          Hide
          lmccay Larry McCay added a comment -

          CMF will be used to access keying material for column level encryption.

          Show
          lmccay Larry McCay added a comment - CMF will be used to access keying material for column level encryption.
          Hide
          lmccay Larry McCay added a comment -

          I am in the process of reworking the patch for HADOOP-9534 Credential Management Framework in order to support accessing keying material for this issue. Current thinking is that CMF can abstract the source of keys and be leveraged across a number of different crypto and password protection usecases in the Hadoop ecosystem. This is why it is being done in Hadoop rather than Hive. We will want to also align it's use with HADOOP-9331 - since 9331 will be leveraged in here as well as for the cryptoFS, etc.

          Will provide a description of the DDL/metastore and column store changes that will be needed to support the column level encryption once I have it written up.

          Show
          lmccay Larry McCay added a comment - I am in the process of reworking the patch for HADOOP-9534 Credential Management Framework in order to support accessing keying material for this issue. Current thinking is that CMF can abstract the source of keys and be leveraged across a number of different crypto and password protection usecases in the Hadoop ecosystem. This is why it is being done in Hadoop rather than Hive. We will want to also align it's use with HADOOP-9331 - since 9331 will be leveraged in here as well as for the cryptoFS, etc. Will provide a description of the DDL/metastore and column store changes that will be needed to support the column level encryption once I have it written up.
          Hide
          apurtell Andrew Purtell added a comment -

          Yes if the code is available and provides the right API.

          Owen O'Malley HADOOP-9331 proposes and provides an API, and HADOOP-9332 provides a codec implementing support for AES with optional hardware acceleration. This seems like an ideal use case for using both. Should you have any proposed improvements to the API please don’t hesitate to raise them on HADOOP-9331, where they will be promptly addressed. Likewise with the AES codec, please don’t hesitate to raise those on HADOOP-9332.

          Show
          apurtell Andrew Purtell added a comment - Yes if the code is available and provides the right API. Owen O'Malley HADOOP-9331 proposes and provides an API, and HADOOP-9332 provides a codec implementing support for AES with optional hardware acceleration. This seems like an ideal use case for using both. Should you have any proposed improvements to the API please don’t hesitate to raise them on HADOOP-9331 , where they will be promptly addressed. Likewise with the AES codec, please don’t hesitate to raise those on HADOOP-9332 .
          Hide
          owen.omalley Owen O'Malley added a comment -

          Andrew,
          Yes if the code is available and provides the right API.

          Show
          owen.omalley Owen O'Malley added a comment - Andrew, Yes if the code is available and provides the right API.
          Hide
          apurtell Andrew Purtell added a comment -

          So do you envision this as using the facilities provided by HADOOP-9331?

          Show
          apurtell Andrew Purtell added a comment - So do you envision this as using the facilities provided by HADOOP-9331 ?
          Hide
          owen.omalley Owen O'Malley added a comment -

          Supun,
          I've tagged this for Google Summer of Code. Take a look at:
          http://www.google-melange.com/gsoc/homepage/google/gsoc2013

          Show
          owen.omalley Owen O'Malley added a comment - Supun, I've tagged this for Google Summer of Code. Take a look at: http://www.google-melange.com/gsoc/homepage/google/gsoc2013
          Hide
          supun Supun Kamburugamuva added a comment -

          I'm a computer science graduate student in Indiana University, Bloomington and my research areas are in distributed computing. I'm also a committer to few Apache projects. I'm new to Hadoop, Hive and I would like to learn and contribute to these projects. It would be great if you can let me know the areas that I should be looking to get started.

          Regards,
          Supun Kamburugamuva

          Show
          supun Supun Kamburugamuva added a comment - I'm a computer science graduate student in Indiana University, Bloomington and my research areas are in distributed computing. I'm also a committer to few Apache projects. I'm new to Hadoop, Hive and I would like to learn and contribute to these projects. It would be great if you can let me know the areas that I should be looking to get started. Regards, Supun Kamburugamuva

            People

            • Assignee:
              owen.omalley Owen O'Malley
              Reporter:
              owen.omalley Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

              • Created:
                Updated:

                Development