Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.0
    • Component/s: None
    • Labels:
      None

      Description

      As part of the encryption at rest story, RFile should support pluggable modules where it currently has hardcoded options for compression codecs. This is a natural place to add encryption capabilities, as the cost of encryption would likely not be significantly different from the cost of compression, and the block-level integration should maintain the same seek and scan performance. Given the many implementation options for both encryption and compression, it makes sense to have a plugin structure here.

        Issue Links

          Activity

          Hide
          afuchs Adam Fuchs added a comment -

          RFile is also used to dump the in-memory map to support isolation of long scans. See org.apache.accumulo.server.tabletserver.InMemoryMap.delete(long). We'll need to make sure the same RFile encryption configuration applies there.

          Show
          afuchs Adam Fuchs added a comment - RFile is also used to dump the in-memory map to support isolation of long scans. See org.apache.accumulo.server.tabletserver.InMemoryMap.delete(long). We'll need to make sure the same RFile encryption configuration applies there.
          Hide
          afuchs Adam Fuchs added a comment -

          Along with internal use of RFile, RFile is also used for bulk import. We should support asymmetric keys that allow external processes to encrypt RFiles so that only Accumulo processes can read them.

          Show
          afuchs Adam Fuchs added a comment - Along with internal use of RFile, RFile is also used for bulk import. We should support asymmetric keys that allow external processes to encrypt RFiles so that only Accumulo processes can read them.
          Hide
          supermallen Michael Allen added a comment -

          I've got a proposal for how to modify the RFile format for it to easily support encryption. We won't change the types or availability of codecs, as previously suggested. Instead we will cue off of the same encryption configuration for the WAL logs and apply it slightly differently to the RFiles.

          The proposal for changes is attached.

          Show
          supermallen Michael Allen added a comment - I've got a proposal for how to modify the RFile format for it to easily support encryption. We won't change the types or availability of codecs, as previously suggested. Instead we will cue off of the same encryption configuration for the WAL logs and apply it slightly differently to the RFiles. The proposal for changes is attached.
          Hide
          kturner Keith Turner added a comment -

          Some comments on proposal V1.

          It seems like the IV would be transparent to RFile, it would just be encryption header information associated with a block. Just like each gzip block probably has some header. From RFiles perspective it just needs to be able to read and write blocks of data. When the encryption codec is not used, there is no per block IV. Does this sound correct? Taking this a step further, should encryption be pushed into BCFile? Currently RFile has no concept of compression, is just reads and write blocks of data to BCFile. BCFile handles compression and stores compression metadata like what codec to use for reading. Even RFiles own root meta block is stored as a regular BCFile meta block and compressed like everything else. Seems like modifying BCfile rather than RFile may be easier. I have already modified BCfile to support multi level indexes in 1.4. BCFile was copied because it was package private, but was not modified for a long time.

          Why is another interface needed? Why not use org.apache.hadoop.io.compress.CompressionCodec? Not saying we should or should not do this, but would like to hear your thoughts since you have looked into this. I see some things in the design doc that I suspect influence this decision, like needed to set Key and IV. While thinking about this I remembered the BigTable paper mentioned using two compression codecs in series.

          In the past we have not supported rolling upgrade from 1.x to 1.(x+1). Would only need to consider this if 1.6 supported it. Changes in the file format would be a small part of a larger effort to support rolling upgrade. Releases to date could always read a file produced by any previous version. So Accumulo 1.4 can read rfiles produced by any previous version of Accumulo.

          Is there any concern with storing unencrypted blocks in memory? The code currently caches uncompressed blocks (but still serialzed with RFile encoding) in memory. Would this be a concern in case these cached block are swapped out? Would we want to keep blocks encrypted in the cache and decrypt only as needed?

          Show
          kturner Keith Turner added a comment - Some comments on proposal V1. It seems like the IV would be transparent to RFile, it would just be encryption header information associated with a block. Just like each gzip block probably has some header. From RFiles perspective it just needs to be able to read and write blocks of data. When the encryption codec is not used, there is no per block IV. Does this sound correct? Taking this a step further, should encryption be pushed into BCFile? Currently RFile has no concept of compression, is just reads and write blocks of data to BCFile. BCFile handles compression and stores compression metadata like what codec to use for reading. Even RFiles own root meta block is stored as a regular BCFile meta block and compressed like everything else. Seems like modifying BCfile rather than RFile may be easier. I have already modified BCfile to support multi level indexes in 1.4. BCFile was copied because it was package private, but was not modified for a long time. Why is another interface needed? Why not use org.apache.hadoop.io.compress.CompressionCodec? Not saying we should or should not do this, but would like to hear your thoughts since you have looked into this. I see some things in the design doc that I suspect influence this decision, like needed to set Key and IV. While thinking about this I remembered the BigTable paper mentioned using two compression codecs in series. In the past we have not supported rolling upgrade from 1.x to 1.(x+1). Would only need to consider this if 1.6 supported it. Changes in the file format would be a small part of a larger effort to support rolling upgrade. Releases to date could always read a file produced by any previous version. So Accumulo 1.4 can read rfiles produced by any previous version of Accumulo. Is there any concern with storing unencrypted blocks in memory? The code currently caches uncompressed blocks (but still serialzed with RFile encoding) in memory. Would this be a concern in case these cached block are swapped out? Would we want to keep blocks encrypted in the cache and decrypt only as needed?
          Hide
          kturner Keith Turner added a comment -

          Some thoughts on storing block un-encrypted in memory. The data has to be decrypted and stored in memory at some point to be read. Not storing it decrypted in cache just reduces the probability of that data swapping. I would think if someone is using encryption, they would appropriately configure swap. I am thinking we should not concern ourselves with swap or scrubbing all memory that ever held encrypted data. I suppose one other consideration with the cache is that the decrypted data could still be floating around there even after a table was deleted. This data would be available to anyone that could do a heap dump.

          Show
          kturner Keith Turner added a comment - Some thoughts on storing block un-encrypted in memory. The data has to be decrypted and stored in memory at some point to be read. Not storing it decrypted in cache just reduces the probability of that data swapping. I would think if someone is using encryption, they would appropriately configure swap. I am thinking we should not concern ourselves with swap or scrubbing all memory that ever held encrypted data. I suppose one other consideration with the cache is that the decrypted data could still be floating around there even after a table was deleted. This data would be available to anyone that could do a heap dump.
          Hide
          supermallen Michael Allen added a comment -

          Hi Keith. Your comments are spot on in terms of modifying the BCFile vs. RFile classes. My prototype changes have all been to BCFile thus far, and I think they will be contained to that. When there's no encryption, then yes, there will be no IV.

          The current design I have in mind should apply equally to anything needing encryption-at-rest within Accumulo, including the WAL, RFiles, and WAL-Map files. Rather than making one encrypting implementation that fits within the compression codec interface, and another that is then used by the WAL file streaming, we have just one spot where one can easily (and generically based on configuration) obtain an enciphering output stream.

          We need to think more about the block caches and their encryption. In the past crypto systems I've worked on (PGP), we generally only worried about very sensitive pieces of information like keys and passwords being swapped to disk, and not so much about things that were encrypted at rest being realized unencrypted into memory. Since we have a lot of control here though (versus where PGP sat), we could consider re-encrypting the cached blocks to a key that only exists in memory and for the lifetime of the process. That way, any swapped blocks would be stored encrypted on the file system to that key, and we don't have to do a lot of key management because they key's really only good for the lifetime of the process. There's some performance considerations there as well, as cached reads will now require a decrypt.

          Show
          supermallen Michael Allen added a comment - Hi Keith. Your comments are spot on in terms of modifying the BCFile vs. RFile classes. My prototype changes have all been to BCFile thus far, and I think they will be contained to that. When there's no encryption, then yes, there will be no IV. The current design I have in mind should apply equally to anything needing encryption-at-rest within Accumulo, including the WAL, RFiles, and WAL-Map files. Rather than making one encrypting implementation that fits within the compression codec interface, and another that is then used by the WAL file streaming, we have just one spot where one can easily (and generically based on configuration) obtain an enciphering output stream. We need to think more about the block caches and their encryption. In the past crypto systems I've worked on (PGP), we generally only worried about very sensitive pieces of information like keys and passwords being swapped to disk, and not so much about things that were encrypted at rest being realized unencrypted into memory. Since we have a lot of control here though (versus where PGP sat), we could consider re-encrypting the cached blocks to a key that only exists in memory and for the lifetime of the process. That way, any swapped blocks would be stored encrypted on the file system to that key, and we don't have to do a lot of key management because they key's really only good for the lifetime of the process. There's some performance considerations there as well, as cached reads will now require a decrypt.
          Hide
          kturner Keith Turner added a comment -

          and another that is then used by the WAL file streaming,

          Is there some reason the WAL encryption could not use a compression codec?

          Show
          kturner Keith Turner added a comment - and another that is then used by the WAL file streaming, Is there some reason the WAL encryption could not use a compression codec?
          Hide
          kturner Keith Turner added a comment -

          Are you considering rekeying? I.e. encrypt new files key with new key and read old files key with an older key.

          Do you have any thoughts on key management? I was thinking that each tablet could have a key. The tablets key is encrypted with the table key. The tablet key is used to encrypt/decrypt file keys. A tablet never has to know the table key, it could ask a centralized service to decrypt its tablet key using the table key. Using this method, the table key would only need to be held in memory on one machine.

          Show
          kturner Keith Turner added a comment - Are you considering rekeying? I.e. encrypt new files key with new key and read old files key with an older key. Do you have any thoughts on key management? I was thinking that each tablet could have a key. The tablets key is encrypted with the table key. The tablet key is used to encrypt/decrypt file keys. A tablet never has to know the table key, it could ask a centralized service to decrypt its tablet key using the table key. Using this method, the table key would only need to be held in memory on one machine.
          Hide
          afuchs Adam Fuchs added a comment -

          RE rekeying and key management, I think that everybody who uses this will have a different perspective on key distribution, which is why it is important to have this be a pluggable system. It does make sense to do things like cycle keys over time, but building in a complex key tree in Accumulo doesn't really make a lot of sense. If you replace "tablet" with "file" in your comment, Keith, then I agree with you, and this can all be done within that plugin without complicating the Accumulo design.

          Show
          afuchs Adam Fuchs added a comment - RE rekeying and key management, I think that everybody who uses this will have a different perspective on key distribution, which is why it is important to have this be a pluggable system. It does make sense to do things like cycle keys over time, but building in a complex key tree in Accumulo doesn't really make a lot of sense. If you replace "tablet" with "file" in your comment, Keith, then I agree with you, and this can all be done within that plugin without complicating the Accumulo design.
          Hide
          afuchs Adam Fuchs added a comment -

          RE encryption in memory, we might want to punt on this for now, and wait for a future hardware solution that just encrypts everything in RAM. That would be independent of Accumulo, and there are few of those on the horizon.

          Show
          afuchs Adam Fuchs added a comment - RE encryption in memory, we might want to punt on this for now, and wait for a future hardware solution that just encrypts everything in RAM. That would be independent of Accumulo, and there are few of those on the horizon.
          Hide
          kturner Keith Turner added a comment -

          I realize that the current proposal has a level of indirection for encryption at the file. I was thinking the tablet may be another natural place for a level of indirection. But that was just thinking out loud. We should certainly support the ability for users to plug in their own key management. We should also explore offering some default key mgmt thats secure inorder to offer a complete feature for 1.6. That would be another ticket though, I'll open that.

          Show
          kturner Keith Turner added a comment - I realize that the current proposal has a level of indirection for encryption at the file. I was thinking the tablet may be another natural place for a level of indirection. But that was just thinking out loud. We should certainly support the ability for users to plug in their own key management. We should also explore offering some default key mgmt thats secure inorder to offer a complete feature for 1.6. That would be another ticket though, I'll open that.
          Hide
          kturner Keith Turner added a comment -

          rekeying is a seperate issue than key management. A tablet server may need to provide the capability to read with key X and write with key Y. This capability may or may not be used by a key management system, but it seems like the low level should support the operation.

          Show
          kturner Keith Turner added a comment - rekeying is a seperate issue than key management. A tablet server may need to provide the capability to read with key X and write with key Y. This capability may or may not be used by a key management system, but it seems like the low level should support the operation.
          Hide
          afuchs Adam Fuchs added a comment -

          If the plugin is given some information as to which key to use and to lookup keys or query a service, then the rekeying should be implementable within the plugin. As long as we tie the session key used to encrypt the RFile to some other key and record that linkage in the file, then the plugin should be able to lookup old keys or use a service to decrypt old keys after the key is cycled. I believe this is in the current design, as part of the crypto info block.

          Show
          afuchs Adam Fuchs added a comment - If the plugin is given some information as to which key to use and to lookup keys or query a service, then the rekeying should be implementable within the plugin. As long as we tie the session key used to encrypt the RFile to some other key and record that linkage in the file, then the plugin should be able to lookup old keys or use a service to decrypt old keys after the key is cycled. I believe this is in the current design, as part of the crypto info block.
          Hide
          supermallen Michael Allen added a comment -

          The design I'm seeing here is that each RFile will have its own encryption key embedded with in its own crypto header block. That key (call it an RFile encryption key) will be in turn encrypted to a different key handled by an outside entity (this is where that SecretKeyEncryptionStrategy class comes into play). Should that external key become compromised, then presumably the only thing that needs to be rekeyed is the RFile encryption key; however, I don't think HDFS works that way, because one can only write a file once, correct? If you really wanted to, you could "rekey" the RFile encryption key by creating a new copy of the RFile with its encryption key decrypted by the old key and re-encrypted to the new key. The rest of the RFile would not have to be decrypted and re-encrypted but the file would presumably have to be copied.

          If this were a grave concern, then the way to solve it would be to make the crypto block an external file to the RFile rather than something embedded inside it. The only problem with that structure is the possibility of the RFile becoming separated from its companion crypto block, and general untidiness putting lots of little files everywhere causes. Since these files generally aren't dealt with directly by users or administrators, maybe this is the direction to go?

          Show
          supermallen Michael Allen added a comment - The design I'm seeing here is that each RFile will have its own encryption key embedded with in its own crypto header block. That key (call it an RFile encryption key) will be in turn encrypted to a different key handled by an outside entity (this is where that SecretKeyEncryptionStrategy class comes into play). Should that external key become compromised, then presumably the only thing that needs to be rekeyed is the RFile encryption key; however, I don't think HDFS works that way, because one can only write a file once, correct? If you really wanted to, you could "rekey" the RFile encryption key by creating a new copy of the RFile with its encryption key decrypted by the old key and re-encrypted to the new key. The rest of the RFile would not have to be decrypted and re-encrypted but the file would presumably have to be copied. If this were a grave concern, then the way to solve it would be to make the crypto block an external file to the RFile rather than something embedded inside it. The only problem with that structure is the possibility of the RFile becoming separated from its companion crypto block, and general untidiness putting lots of little files everywhere causes. Since these files generally aren't dealt with directly by users or administrators, maybe this is the direction to go?
          Hide
          kturner Keith Turner added a comment -

          I don't think HDFS works that way, because one can only write a file once, correct?

          correct. If compactions can read and write data with different keys, then that provides a way to rekey. It sounds like your design may support this already? Granted its way more expensive than it needs to be. I would be happy with the compaction rekeying capability at first, because at least I have the capability. The optimizations you mentioned for rekeying rfile's more efficiently could be something to consider later.

          Show
          kturner Keith Turner added a comment - I don't think HDFS works that way, because one can only write a file once, correct? correct. If compactions can read and write data with different keys, then that provides a way to rekey. It sounds like your design may support this already? Granted its way more expensive than it needs to be. I would be happy with the compaction rekeying capability at first, because at least I have the capability. The optimizations you mentioned for rekeying rfile's more efficiently could be something to consider later.
          Hide
          supermallen Michael Allen added a comment -

          Yeah, my design will support reading with one key and writing with another. Actually, it will always do this, as each file has a separate key. So yes, you could force a compaction to force a rekey. Agreed about looking later at how to optimize this.

          Overall, then, Keith, does the proposal make sense? Seems workable?

          Show
          supermallen Michael Allen added a comment - Yeah, my design will support reading with one key and writing with another. Actually, it will always do this, as each file has a separate key. So yes, you could force a compaction to force a rekey. Agreed about looking later at how to optimize this. Overall, then, Keith, does the proposal make sense? Seems workable?
          Hide
          kturner Keith Turner added a comment -

          Actually, it will always do this, as each file has a separate key.

          I realize this, because each RFile has its own internal key encrypted with an external key. I was wondering about this external key, if you could have different ones for reading and writing. Need to establish some firm terminology for the different classes of keys, so discussion is clear. Maybe add something to the proposal naming each key type and defining it. Then that name could be referenced in discussion.

          Overall, then, Keith, does the proposal make sense? Seems workable?

          What you are proposing for RFile sounds good. It would be nice to outline how you are thinking of integrating this rfile capability into Accumulo. I suppose these are some of the questions that I have been asking. When a major compaction happens, how does it get its keys? What does that interaction look like?

          Show
          kturner Keith Turner added a comment - Actually, it will always do this, as each file has a separate key. I realize this, because each RFile has its own internal key encrypted with an external key. I was wondering about this external key, if you could have different ones for reading and writing. Need to establish some firm terminology for the different classes of keys, so discussion is clear. Maybe add something to the proposal naming each key type and defining it. Then that name could be referenced in discussion. Overall, then, Keith, does the proposal make sense? Seems workable? What you are proposing for RFile sounds good. It would be nice to outline how you are thinking of integrating this rfile capability into Accumulo. I suppose these are some of the questions that I have been asking. When a major compaction happens, how does it get its keys? What does that interaction look like?
          Hide
          billie.rinaldi Billie Rinaldi added a comment -

          Is there some reason the WAL encryption could not use a compression codec?

          Keith, could you expand on your suggestion that a compression codec could be used for WAL streaming encryption? The encryption would be much more reusable if we could do it all in a codec.

          What do people think about the encryption codec patch attached to HADOOP-9333? Could that be adapted for our purposes?

          Show
          billie.rinaldi Billie Rinaldi added a comment - Is there some reason the WAL encryption could not use a compression codec? Keith, could you expand on your suggestion that a compression codec could be used for WAL streaming encryption? The encryption would be much more reusable if we could do it all in a codec. What do people think about the encryption codec patch attached to HADOOP-9333 ? Could that be adapted for our purposes?
          Hide
          vines John Vines added a comment -

          Resolved with patch under ACCUMULO-998

          Show
          vines John Vines added a comment - Resolved with patch under ACCUMULO-998

            People

            • Assignee:
              supermallen Michael Allen
              Reporter:
              afuchs Adam Fuchs
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development