HADOOP-10150 & HDFS-6134]
Avik Dey, I’ve just looked at the MAR/21 proposal in
HADOOP-10150 (the patches uploaded on MAR/21 do not apply on trunk cleanly, so I cannot look at them easily. It seems to have missing pieces, like getXAttrs() and wiring to KeyProvider API. Would be possible to rebased them so they apply to trunk?)
do we need a new proposal for the work already being done on
HADOOP-10150 aims to provide encryption for any filesystem implementation as a decorator filesystem. While HDFS-6134 aims to provide encryption for HDFS.
The 2 approaches differ on the level of transparency you get. The comparison table in the "HDFS Data at Rest Encryption" attachment (https://issues.apache.org/jira/secure/attachment/12635964/HDFSDataAtRestEncryption.pdf) highlights the differences.
Particularly, the things I’m concerned the most with
- All clients (doing encryption/decryption) must have access the key management service.
- Secure key propagation to tasks running in the cluster (i.e. mapper and reducer tasks)
- Use of AES-CTR (instead of an authenticated encryption mode such as AES-GCM)
- Not clear how hflush()
are there design choices in this proposal that are superior to the patch already provided on
IMO, a consolidated access/distribution of keys by the NN (as opposed to every client) improves the security of the system.
do you have additional requirement listed in this JIRA that could be incorporated in to
They are enumerated in the "HDFS Data at Rest Encryption" attachment. The ones I don’t see them address in
HADOOP-10150 are: #6, #8.A. And it is not clear how #4 & #5 can be achieved.
so we can collaborate and not duplicate?
Definitely, I want to work together with you guys to leverage as much as posible. Either by unifying the 2 proposal or by sharing common code if we think both approaches have merits and we decide to move forward with both.
Happy to jump on a call to discuss things and the report back to the community if you think that will speed up the discussion.
By looking at the latest design doc of
HADOOP-10150 I can see that things have been modified a bit (from the original design doc) bringing it a bit closer to some of the HDFS-6134 requirements.
Still, it is not clear how transparency will be achieved for existing applications: HDFS URI changes, clients must connect to the Key store to retrieve the encryption key (clients will need key store principals). The encryption key must be propagated to jobs tasks (i.e. Mapper/Reducer processes)
Requirement #4 "Can decorate HDFS and all other file systems in Hadoop, and will not modify existing structure of file system, such as namenode and datanode structure if the wrapped file system is HDFS." This is contradicted by the design, in the "Storage of IV and data key" is stated "So we implement extended information based on INode feature, and use it to store data key and IV. "
Requirement #5 "Admin can configure encryption policies, such as which directory will be encrypted.", this seems driven by HDFS client configuration file (hdfs-site.xml). This is not really admin driven as clients could break this by configuring their hdfs-site.xml file)
Restrictions of move operations for files within an encrypted directory. The original design had something about it (not entirely correct), now is gone.
(Mentioned before), how thing flush() operations will be handled as the encryption block will be cut short? How this is handled on writes? How this is handled on reads?
Explicit auditing on encrypted files access does not seem handled.