Hadoop Common
  1. Hadoop Common
  2. HADOOP-10919

Copy command should preserve raw.* namespace extended attributes

    Details

      Description

      Refer to the doc attached to HDFS-6509 for background.

      Like distcp -p (see MAPREDUCE-6007), the copy command also needs to preserve extended attributes in the raw.* namespace by default whenever the src and target are in /.reserved/raw. To not preserve raw xattrs, don't specify /.reserved/raw in either the src or target.

      1. HDFS-6134-Distcp-cp-UseCasesTable2.pdf
        52 kB
        Sanjay Radia
      2. HDFS-6134-Distcp-cp-UseCasesTable.pdf
        50 kB
        Sanjay Radia
      3. HADOOP-10919.002.patch
        19 kB
        Charles Lamb
      4. HADOOP-10919.001.patch
        15 kB
        Charles Lamb

        Issue Links

          Activity

          Hide
          Sanjay Radia added a comment -

          I misunderstood the EZKey. Matching does not matter for distcp/cp. I have updated the use cases table.

          Show
          Sanjay Radia added a comment - I misunderstood the EZKey. Matching does not matter for distcp/cp. I have updated the use cases table.
          Hide
          Sanjay Radia added a comment -

          I have attached a table that shows the distcp/cp use cases and the desirable outcomes. I think this implementable in a transparent fashion within distcp or cp using /r/r mechanism.

          Show
          Sanjay Radia added a comment - I have attached a table that shows the distcp/cp use cases and the desirable outcomes. I think this implementable in a transparent fashion within distcp or cp using /r/r mechanism.
          Hide
          Andrew Wang added a comment -

          Note that if you copy from at or above the EZ root, it'll preserve the EZ root's raw xattrs and thus create the EZ. We have a special hook in FSDirectory#unprotectedSetXAttrs that watches for the special EZ xattr being set. If you're copying from below the EZ root, then only that subtree is preserved. We don't automatically create an EZ above the distcp dst (which would be kind of weird behavior).

          Show
          Andrew Wang added a comment - Note that if you copy from at or above the EZ root, it'll preserve the EZ root's raw xattrs and thus create the EZ. We have a special hook in FSDirectory#unprotectedSetXAttrs that watches for the special EZ xattr being set. If you're copying from below the EZ root, then only that subtree is preserved. We don't automatically create an EZ above the distcp dst (which would be kind of weird behavior).
          Hide
          Charles Lamb added a comment -

          Q. when you say "distcp /r/r/src /r/r/dest" are the keys read from src and preserved in the dest? Does the act of copying the keys from a /r/r/src into a /r/r/dest automatically set up a matching EZ in the destination?

          Yes to the first question and no to the second. Copying the keys occurs and that is almost good enough to set up a matching EZ. However, what doesn't happen is a call to createEncryptionZone so there is not an actual EZ in place on the dst. The admin is expected to have done that before the distcp. If the admin wants a parallel EZ (i.e. with the same keys, ez-key, etc.) – and presumably they do because they're copying from /.r/r to /.r/r and preserving the keys along the way (this is my case "(a)" above) – then it is also expected that if the dest NN is not the same as the src (likely) that the NN and the clients accessing that NN will have equal access to the KMS (presumably the same KMS is shared across src and dst).

          Does this make sense?

          Show
          Charles Lamb added a comment - Q. when you say "distcp /r/r/src /r/r/dest" are the keys read from src and preserved in the dest? Does the act of copying the keys from a /r/r/src into a /r/r/dest automatically set up a matching EZ in the destination? Yes to the first question and no to the second. Copying the keys occurs and that is almost good enough to set up a matching EZ. However, what doesn't happen is a call to createEncryptionZone so there is not an actual EZ in place on the dst. The admin is expected to have done that before the distcp. If the admin wants a parallel EZ (i.e. with the same keys, ez-key, etc.) – and presumably they do because they're copying from /.r/r to /.r/r and preserving the keys along the way (this is my case "(a)" above) – then it is also expected that if the dest NN is not the same as the src (likely) that the NN and the clients accessing that NN will have equal access to the KMS (presumably the same KMS is shared across src and dst). Does this make sense?
          Hide
          Sanjay Radia added a comment -

          Q. when you say "distcp /r/r/src /r/r/dest" are the keys read from src and preserved in the dest? Does the act of copying the keys from a /r/r/src into a /r/r/dest automatically set up a matching EZ in the destination?

          Show
          Sanjay Radia added a comment - Q. when you say "distcp /r/r/src /r/r/dest" are the keys read from src and preserved in the dest? Does the act of copying the keys from a /r/r/src into a /r/r/dest automatically set up a matching EZ in the destination?
          Hide
          Charles Lamb added a comment -

          I'll update the HDFS-6509 doc to reflect the bit about trashing.

          1. src subtree and dst subtree do not have EZ - easy, same as today

          Agreed.

          2. src subtree has no EZ but dest does have EZ in a portion of its subtree. Possible outcomes
          1. if user performing operation has permissions in dest EZ then the files within the dest EZ subtree are encrypted

          Agreed.

          2. src subtree has no EZ but dest does have EZ in a portion of its subtree. Possible outcomes
          ...
          2. if user does not (say Admin) what do we expect to happen?

          The behavior should be the same as what happens today: user (the admin) gets a permission violation because the admin does not have access to the target.

          3. src subtree has EZ but dest does not. Possible outcomes
          1. files copied as encrypted but cannot be decryptied at the dest since it does not have an EZ zone- useful as a backup

          /.r/r: raw files are copied to dest so dest contains encrypted (and unreadable) files
          !/.r/r: files are decrypted by distcp and copied to dst (decrypted). Files are readable because they have been decrypted during the copy.

          3. src subtree has EZ but dest does not. Possible outcomes
          ...
          2. files copied as encrypted and a matching EZ is created automatically. Can an admin do this operation since he does not have access to the keys?

          I don't think that distcp can, or should, create a matching EZ automatically. It is too hard for it to know what the intent of the copy is. Should the new ez have the same ez-key as the src ez or a different one? Sure, we could have an option to let the user specify that, but for the first crack I wanted to keep it fairly simple. So, the theory is that the admin creates the empty EZ before performing the distcp. The admin can either set up the EZ with the same ez-key as the src ez (call this "(a)" below, or the dest can have a different ez-key than the src (call this "(b)" below. After the ez is created, then distcp will try to maintain the files as encrypted. In either of those scenarios, there are a couple of cases:

          distcp with /.r/r: (a) works ok because the EDEKs for each file are copied from src to dst. (b) does not work because when the files are opened in the dest hierarchy, the EDEKs will be decrypted with the new ez-key(dst) and that won't work. This could be made to work by having the KMS decrypt the EDEKs and re-encrypt them with the new ez-key(dst), but it would assume that the distcp invoker had proper credentials with the KMS for the keys. So in general this scenario is only useful when the src-ez and the dst-ez have been setup with the same ez-key. There are other issues with this that are discussed under HDFS-6134, such as different key lengths, etc.

          distcp with no /.r/r: Both of (a) and (b) work ok as long as the invoker has access to the files that are being copied. distcp decrypts the files on read and they get re-encrypted on write. This is pretty much the same as today.

          3. src subtree has EZ but dest does not. Possible outcomes
          ...
          3. throw an error which can be overidden by a flag in which case the files are decryoted and copied to in dest are left decrypted . This only works if the user has permissions for decryption; admin cannot do this.

          /.r/r: The files aren't decrypted so this scenario is perfectly acceptable.

          !/.r/r: As you say, the admin can't do this because they presumably don't have access to the files on the src (and probably not on the target either). So this scenario is about some random user doing a distcp of some subset of the tree on their own. I think that what you're suggesting is a way of trying to keep the user from shooting themselves in the foot by ensuring that they don't leave unencrypted data hanging around in the dest. I can see this both ways. On the one hand, someone has given the user access to the files and keys. They are expected to "do the right thing" with the decrypted file contents, including not putting it somewhere "unsafe". It is "transparent encryption" after all. And they might actually want to leave it hanging around in unencrypted form because (e.g.) maybe dst is on a cluster inside a SCIF and it's ok to leave the files unencrypted.

          But I think I like your suggestion that we throw an exception in this case (user not using /.r/r, any of the source paths are in an ez, dest is not in an ez) unless a flag is set.

          4. both src and dest have EZ at exactly the same part of the subtree. Possible outcomes
          1. If user has permission to decrypt and encrypt, then the data is copied and encryption is redone with new keys,
          2. If user does not have permission then ?? Fail or copy as raw?

          I think we should just treat this the same as current behavior. We just attempt it and if it works great, and if not, we throw the exception. So I don't think there's anything unusual here.

          5. both src and dest have EZ at different parts of the subtree. This should reduce to 2 or 3.

          Agreed.

          Show
          Charles Lamb added a comment - I'll update the HDFS-6509 doc to reflect the bit about trashing. 1. src subtree and dst subtree do not have EZ - easy, same as today Agreed. 2. src subtree has no EZ but dest does have EZ in a portion of its subtree. Possible outcomes 1. if user performing operation has permissions in dest EZ then the files within the dest EZ subtree are encrypted Agreed. 2. src subtree has no EZ but dest does have EZ in a portion of its subtree. Possible outcomes ... 2. if user does not (say Admin) what do we expect to happen? The behavior should be the same as what happens today: user (the admin) gets a permission violation because the admin does not have access to the target. 3. src subtree has EZ but dest does not. Possible outcomes 1. files copied as encrypted but cannot be decryptied at the dest since it does not have an EZ zone- useful as a backup /.r/r: raw files are copied to dest so dest contains encrypted (and unreadable) files !/.r/r: files are decrypted by distcp and copied to dst (decrypted). Files are readable because they have been decrypted during the copy. 3. src subtree has EZ but dest does not. Possible outcomes ... 2. files copied as encrypted and a matching EZ is created automatically. Can an admin do this operation since he does not have access to the keys? I don't think that distcp can, or should, create a matching EZ automatically. It is too hard for it to know what the intent of the copy is. Should the new ez have the same ez-key as the src ez or a different one? Sure, we could have an option to let the user specify that, but for the first crack I wanted to keep it fairly simple. So, the theory is that the admin creates the empty EZ before performing the distcp. The admin can either set up the EZ with the same ez-key as the src ez (call this "(a)" below, or the dest can have a different ez-key than the src (call this "(b)" below. After the ez is created, then distcp will try to maintain the files as encrypted. In either of those scenarios, there are a couple of cases: distcp with /.r/r: (a) works ok because the EDEKs for each file are copied from src to dst. (b) does not work because when the files are opened in the dest hierarchy, the EDEKs will be decrypted with the new ez-key(dst) and that won't work. This could be made to work by having the KMS decrypt the EDEKs and re-encrypt them with the new ez-key(dst), but it would assume that the distcp invoker had proper credentials with the KMS for the keys. So in general this scenario is only useful when the src-ez and the dst-ez have been setup with the same ez-key. There are other issues with this that are discussed under HDFS-6134 , such as different key lengths, etc. distcp with no /.r/r: Both of (a) and (b) work ok as long as the invoker has access to the files that are being copied. distcp decrypts the files on read and they get re-encrypted on write. This is pretty much the same as today. 3. src subtree has EZ but dest does not. Possible outcomes ... 3. throw an error which can be overidden by a flag in which case the files are decryoted and copied to in dest are left decrypted . This only works if the user has permissions for decryption; admin cannot do this. /.r/r: The files aren't decrypted so this scenario is perfectly acceptable. !/.r/r: As you say, the admin can't do this because they presumably don't have access to the files on the src (and probably not on the target either). So this scenario is about some random user doing a distcp of some subset of the tree on their own. I think that what you're suggesting is a way of trying to keep the user from shooting themselves in the foot by ensuring that they don't leave unencrypted data hanging around in the dest. I can see this both ways. On the one hand, someone has given the user access to the files and keys. They are expected to "do the right thing" with the decrypted file contents, including not putting it somewhere "unsafe". It is "transparent encryption" after all. And they might actually want to leave it hanging around in unencrypted form because (e.g.) maybe dst is on a cluster inside a SCIF and it's ok to leave the files unencrypted. But I think I like your suggestion that we throw an exception in this case (user not using /.r/r, any of the source paths are in an ez, dest is not in an ez) unless a flag is set. 4. both src and dest have EZ at exactly the same part of the subtree. Possible outcomes 1. If user has permission to decrypt and encrypt, then the data is copied and encryption is redone with new keys, 2. If user does not have permission then ?? Fail or copy as raw? I think we should just treat this the same as current behavior. We just attempt it and if it works great, and if not, we throw the exception. So I don't think there's anything unusual here. 5. both src and dest have EZ at different parts of the subtree. This should reduce to 2 or 3. Agreed.
          Hide
          Sanjay Radia added a comment -

          trashing .... It's assumed that an hdfs admin would not (intentionally) do that.

          Okay, please add that your doc when you next update it. We could allow just read access to /r/r/ to all.

          Use cases: charles can we please work together to get the distcp use cases nailed. We can work offline to go faster and then summarize for the community.

          Show
          Sanjay Radia added a comment - trashing .... It's assumed that an hdfs admin would not (intentionally) do that. Okay, please add that your doc when you next update it. We could allow just read access to /r/r/ to all. Use cases: charles can we please work together to get the distcp use cases nailed. We can work offline to go faster and then summarize for the community.
          Hide
          Charles Lamb added a comment -

          Hi Sanjay,

          The trashing would be due to non-admin users having access to the raw.* xattrs via /.r/r. If they were able to corrupt the xattrs, then that would effectively trash the file. It's assumed that an hdfs admin would not (intentionally) do that.

          Show
          Charles Lamb added a comment - Hi Sanjay, The trashing would be due to non-admin users having access to the raw.* xattrs via /.r/r. If they were able to corrupt the xattrs, then that would effectively trash the file. It's assumed that an hdfs admin would not (intentionally) do that.
          Hide
          Sanjay Radia added a comment -

          Charles lets enumerate the distcp use cases - Here is my first draft. Below for some of the use cases I propose possible desirable outcomes but these outcomes can be debated separately from the use cases,

          1. src subtree and dst subtree do not have EZ - easy, same as today
          2. src subtree has no EZ but dest does have EZ in a portion of its subtree. Possible outcomes
            1. - if user performing operation has permissions in dest EZ then the files within the dest EZ subtree are encrypted
            2. if user does not (say Admin) what do we expect to happen?
          3. src subtree has EZ but dest does not. Possible outcomes
            1. files copied as encrypted but cannot be decryptied at the dest since it does not have an EZ zone- useful as a backup
            2. files copied as encrypted and a matching EZ is created automatically. Can an admin do this operation since he does not have access to the keys?
            3. throw an error which can be overidden by a flag in which case the files are decryoted and copied to in dest are left decrypted . This only works if the user has permissions for decryption; admin cannot do this.
          4. both src and dest have EZ at exactly the same part of the subtree. Possible outcomes
            1. If user has permission to decrypt and encrypt, then the data is copied and encryption is redone with new keys,
            2. If user does not have permission then ?? Fail or copy as raw?
          5. both src and dest have EZ at different parts of the subtree. This should reduce to 2 or 3.

          For each of the above one can have distcp do the right thing automatically or we can force the user to explicitly submit /r/r/path as appropriate, Lets explore both approaches and see which one works better.

          Show
          Sanjay Radia added a comment - Charles lets enumerate the distcp use cases - Here is my first draft. Below for some of the use cases I propose possible desirable outcomes but these outcomes can be debated separately from the use cases, src subtree and dst subtree do not have EZ - easy, same as today src subtree has no EZ but dest does have EZ in a portion of its subtree. Possible outcomes - if user performing operation has permissions in dest EZ then the files within the dest EZ subtree are encrypted if user does not (say Admin) what do we expect to happen? src subtree has EZ but dest does not. Possible outcomes files copied as encrypted but cannot be decryptied at the dest since it does not have an EZ zone- useful as a backup files copied as encrypted and a matching EZ is created automatically. Can an admin do this operation since he does not have access to the keys? throw an error which can be overidden by a flag in which case the files are decryoted and copied to in dest are left decrypted . This only works if the user has permissions for decryption; admin cannot do this. both src and dest have EZ at exactly the same part of the subtree. Possible outcomes If user has permission to decrypt and encrypt, then the data is copied and encryption is redone with new keys, If user does not have permission then ?? Fail or copy as raw? both src and dest have EZ at different parts of the subtree. This should reduce to 2 or 3. For each of the above one can have distcp do the right thing automatically or we can force the user to explicitly submit /r/r/path as appropriate, Lets explore both approaches and see which one works better.
          Hide
          Sanjay Radia added a comment -

          Charles can you expand on what trashing you are worried about? One only needs read access on the src side.

          Show
          Sanjay Radia added a comment - Charles can you expand on what trashing you are worried about? One only needs read access on the src side.
          Hide
          Charles Lamb added a comment -

          Hi Sanjay Radia,

          Is this mentioned in the distcp doc and I missed it?

          Yes, third para of the second page: "Only HDFS admins have access to the raw hierarchy as this will prevent regular users from trashing files in an EZ."

          Show
          Charles Lamb added a comment - Hi Sanjay Radia , Is this mentioned in the distcp doc and I missed it? Yes, third para of the second page: "Only HDFS admins have access to the raw hierarchy as this will prevent regular users from trashing files in an EZ."
          Hide
          Sanjay Radia added a comment -

          Right now it's transparent in that distcp will decrypt when it reads from the normal path. This is what all existing distcp scripts will be doing, copying to and from normal paths. ... but it's a reasonable and sometimes desirable behavior.

          At the meeting and in the jira we concluded that the above behavior is not desirable: the user running the distcp may not have permission to decrypt (e.g. an Admin at NSA). Second, the data is being transmitted in the clear. Third the efficiency argument. You are saying "but it's a reasonable and sometimes desirable behavior." - I thought we have established it is not and hence we are doing the /.r/.r and that distcp will take advantage of it. I hope you still want to do /.r/.r? Maybe you are asserting that /.r/.r was unnecessary but you are willing to do it to please a few in the community. That okay - we can agree to disagree here.

          I would have thought that if distcp prefixes all paths by /.r/.r then it would just work. Your comments says that "/.r/r is also superuser only" – not sure what you mean - only superuer can access /.r/.r? Surely that is not the case? Is this mentioned in the distcp doc and I missed it?

          Show
          Sanjay Radia added a comment - Right now it's transparent in that distcp will decrypt when it reads from the normal path. This is what all existing distcp scripts will be doing, copying to and from normal paths. ... but it's a reasonable and sometimes desirable behavior. At the meeting and in the jira we concluded that the above behavior is not desirable: the user running the distcp may not have permission to decrypt (e.g. an Admin at NSA). Second, the data is being transmitted in the clear. Third the efficiency argument. You are saying "but it's a reasonable and sometimes desirable behavior." - I thought we have established it is not and hence we are doing the /.r/.r and that distcp will take advantage of it. I hope you still want to do /.r/.r? Maybe you are asserting that /.r/.r was unnecessary but you are willing to do it to please a few in the community. That okay - we can agree to disagree here. I would have thought that if distcp prefixes all paths by /.r/.r then it would just work. Your comments says that "/.r/r is also superuser only" – not sure what you mean - only superuer can access /.r/.r? Surely that is not the case? Is this mentioned in the distcp doc and I missed it?
          Hide
          Andrew Wang added a comment -

          Hi Sanjay,

          Could we define the requirements for "transparent"? Right now it's transparent in that distcp will decrypt when it reads from the normal path. This is what all existing distcp scripts will be doing, copying to and from normal paths. It's less efficient since it involves decryption, and results in different bytes-on-disk on the destination (either because it's unencrypted, or it's given a different EDEK), but it's a reasonable and sometimes desirable behavior. Using the /.reserved/raw paths is a way of doing a direct byte-to-byte identical copy, which is also a sometimes desirable behavior.

          It sounds like you want the direct byte-to-byte copy to be the default, but remember that it's an API with sharp edges, many of which are laid out in the doc. /.r/r is also superuser only, since it lets you muck directly with the raw xattrs. This means we can't transparently add the /.r/r prefix if the distcp runs as a normal user. Because of all this, we decided to implement the current, safer behavior.

          Does this sound reasonable?

          Show
          Andrew Wang added a comment - Hi Sanjay, Could we define the requirements for "transparent"? Right now it's transparent in that distcp will decrypt when it reads from the normal path. This is what all existing distcp scripts will be doing, copying to and from normal paths. It's less efficient since it involves decryption, and results in different bytes-on-disk on the destination (either because it's unencrypted, or it's given a different EDEK), but it's a reasonable and sometimes desirable behavior. Using the /.reserved/raw paths is a way of doing a direct byte-to-byte identical copy, which is also a sometimes desirable behavior. It sounds like you want the direct byte-to-byte copy to be the default, but remember that it's an API with sharp edges, many of which are laid out in the doc. /.r/r is also superuser only, since it lets you muck directly with the raw xattrs. This means we can't transparently add the /.r/r prefix if the distcp runs as a normal user. Because of all this, we decided to implement the current, safer behavior. Does this sound reasonable?
          Hide
          Sanjay Radia added a comment -

          Given that, I'm wondering what would the purpose be for checking that the target is an EZ?

          You mentioned that in your doc and hence I raised it here.

          Given that your document mentioned that the target and src must match wrt to EZ I thought that you had made distcp transparent: ie distcp will check if any dir in the subtree is EZ and will prefix by /.reserved/.raw. And I think that is a good idea since it will mean that all existing distcp scripts will continue to work if you set the EZ on the src and target correctly.

          Show
          Sanjay Radia added a comment - Given that, I'm wondering what would the purpose be for checking that the target is an EZ? You mentioned that in your doc and hence I raised it here. Given that your document mentioned that the target and src must match wrt to EZ I thought that you had made distcp transparent: ie distcp will check if any dir in the subtree is EZ and will prefix by /.reserved/.raw. And I think that is a good idea since it will mean that all existing distcp scripts will continue to work if you set the EZ on the src and target correctly.
          Hide
          Charles Lamb added a comment -

          I should clarify case (1). If you are distcp'ing from the ez root or higher, then you don't need to pre-create the EZ because all of the raw.* xattrs will be preserved.

          Given that, I'm wondering what would the purpose be for checking that the target is an EZ?

          Show
          Charles Lamb added a comment - I should clarify case (1). If you are distcp'ing from the ez root or higher, then you don't need to pre-create the EZ because all of the raw.* xattrs will be preserved. Given that, I'm wondering what would the purpose be for checking that the target is an EZ?
          Hide
          Charles Lamb added a comment -

          Sanjay,

          There are three scenarios.

          (1) An administrator who does not have access to the keys in the KMS would use the /.reserved/raw prefix on src and dest:

          distcp /.reserved/raw/src /.reserved/raw/dest

          The /.reserved/raw is the only interface that exposes the raw.* xattrs holding the encryption metadata. This allows the raw.* xattrs to be preserved on the dest as well as to copy the files without decrypting them. This scenario assumes that an ez has been set up on dest. As you suggested, it would be a good idea to check that the dest is actually an ez.

          (2) A non-admin user who has access to some subset of files in an ez could use the non-/.reserved/raw prefix and copy a hierarchy from one ez to another. In that case, the raw.* xattrs from the src ez would not be preserved. This scenario assumes that the dest ez is already set up. Of course the dest files will have new keys associated with them since they'll be new copies.

          (3) Neither src or dst has /.reserved/raw and one or the other of src/dest is not an ez. It is not necessary to have the target also be an ez. The use case would be that the user wants to copy a subset of the ez into/out-of a non-encrypted file system. distcp without the /.reserved/raw prefix could be used for this.

          Does this all make sense?

          Show
          Charles Lamb added a comment - Sanjay, There are three scenarios. (1) An administrator who does not have access to the keys in the KMS would use the /.reserved/raw prefix on src and dest: distcp /.reserved/raw/src /.reserved/raw/dest The /.reserved/raw is the only interface that exposes the raw.* xattrs holding the encryption metadata. This allows the raw.* xattrs to be preserved on the dest as well as to copy the files without decrypting them. This scenario assumes that an ez has been set up on dest. As you suggested, it would be a good idea to check that the dest is actually an ez. (2) A non-admin user who has access to some subset of files in an ez could use the non-/.reserved/raw prefix and copy a hierarchy from one ez to another. In that case, the raw.* xattrs from the src ez would not be preserved. This scenario assumes that the dest ez is already set up. Of course the dest files will have new keys associated with them since they'll be new copies. (3) Neither src or dst has /.reserved/raw and one or the other of src/dest is not an ez. It is not necessary to have the target also be an ez. The use case would be that the user wants to copy a subset of the ez into/out-of a non-encrypted file system. distcp without the /.reserved/raw prefix could be used for this. Does this all make sense?
          Hide
          Sanjay Radia added a comment -

          charles, what is the usage model for distcp of encrypted files:

          • distcp path1 path2 - where distcp will insert /.reserved/.raw to the pathnames if in encrypted zone.
          • OR distcp /.reserved/.raw/path1 /.reserved/.raw/path2

          BTW is the proposal is that both src and dest MUST be encryptedZones or neither ? (Because of your "misspoke" comment I am a little confused.)

          Show
          Sanjay Radia added a comment - charles, what is the usage model for distcp of encrypted files: distcp path1 path2 - where distcp will insert /.reserved/.raw to the pathnames if in encrypted zone. OR distcp /.reserved/.raw/path1 /.reserved/.raw/path2 BTW is the proposal is that both src and dest MUST be encryptedZones or neither ? (Because of your "misspoke" comment I am a little confused.)
          Hide
          Charles Lamb added a comment -

          Sanjay,

          I just re-read your comment and I realized that I mis-spoke.

          Yes, I think it would make sense. I'll open a jira for that.

          Thanks.

          Show
          Charles Lamb added a comment - Sanjay, I just re-read your comment and I realized that I mis-spoke. Yes, I think it would make sense. I'll open a jira for that. Thanks.
          Hide
          Charles Lamb added a comment -

          Charles, you list disadvantage for the .raw scheme where the target of a distcp is not an encrypted zone. Would it make sense for distcp to check for that and to fail the distcp?

          Hi Sanjay,

          Presently distcp requires both src and target to be either both in /.reserved/raw or neither in /.reserved/raw.

          I'll update the HDFS-6509 document and comments.

          Thanks for catching that.

          Show
          Charles Lamb added a comment - Charles, you list disadvantage for the .raw scheme where the target of a distcp is not an encrypted zone. Would it make sense for distcp to check for that and to fail the distcp? Hi Sanjay, Presently distcp requires both src and target to be either both in /.reserved/raw or neither in /.reserved/raw. I'll update the HDFS-6509 document and comments. Thanks for catching that.
          Hide
          Sanjay Radia added a comment -

          Charles, the work you did for distcp needs to be also applied to har. I suspect .raw would also work.

          Show
          Sanjay Radia added a comment - Charles, the work you did for distcp needs to be also applied to har. I suspect .raw would also work.
          Hide
          Sanjay Radia added a comment -

          Charles, you list disadvantage for the .raw scheme where the target of a distcp is not an encrypted zone. Would it make sense for distcp to check for that and to fail the distcp?

          Show
          Sanjay Radia added a comment - Charles, you list disadvantage for the .raw scheme where the target of a distcp is not an encrypted zone. Would it make sense for distcp to check for that and to fail the distcp?
          Hide
          Charles Lamb added a comment -

          Thanks for the review Andrew Wang. I've committed this to the fs-encryption branch.

          Show
          Charles Lamb added a comment - Thanks for the review Andrew Wang . I've committed this to the fs-encryption branch.
          Hide
          Andrew Wang added a comment -

          +1 LGTM thanks charles

          Show
          Andrew Wang added a comment - +1 LGTM thanks charles
          Hide
          Charles Lamb added a comment -

          Thanks for the review Andrew Wang. I believe the .002 patch addresses your comments above.

          Show
          Charles Lamb added a comment - Thanks for the review Andrew Wang . I believe the .002 patch addresses your comments above.
          Hide
          Andrew Wang added a comment -

          This looks basically right, just a few review comments:

          • Would be nice to quote paths in exception messages for clarity
          • Could mention that checkPathsForReservedRaw expects fully-qualified paths (i.e. not relative)

          Test:

          • testCopyCommandsWithRawXAttrs, setting the xattrs looks like it could be turned into two loops. We also copy pasted the same xattr names and values in checkXAttrs, seems like we could dedupe this. There's also a double semi-colon.
          • Not a huge fan of the per-parameter in-line comments, not something I've seen in Hadoop before. IDEs help you figure this out without a comment.
          • checkXAttrs, would be better to do assertEquals than assertTrue for the size. Should have error messages in all the asserts too.
          • I would feel a bit better if we tested a relative destination with a ".." as well, though I'm fairly sure that it works.

          Thanks Charles!

          Show
          Andrew Wang added a comment - This looks basically right, just a few review comments: Would be nice to quote paths in exception messages for clarity Could mention that checkPathsForReservedRaw expects fully-qualified paths (i.e. not relative) Test: testCopyCommandsWithRawXAttrs, setting the xattrs looks like it could be turned into two loops. We also copy pasted the same xattr names and values in checkXAttrs, seems like we could dedupe this. There's also a double semi-colon. Not a huge fan of the per-parameter in-line comments, not something I've seen in Hadoop before. IDEs help you figure this out without a comment. checkXAttrs, would be better to do assertEquals than assertTrue for the size. Should have error messages in all the asserts too. I would feel a bit better if we tested a relative destination with a ".." as well, though I'm fairly sure that it works. Thanks Charles!
          Hide
          Charles Lamb added a comment -

          This patch more closely matches the proposed behavior of distcp wrt raw xattrs:

          . There is no -pd option.
          . Determination of whether or not to preserve raw.* xattrs is based only on whether all of the source and target pathnames are in the /.reserved/raw hierarchy.
          . If any of the sources or target paths are different wrt polarity of /.r/r then an exception is thrown.

          Show
          Charles Lamb added a comment - This patch more closely matches the proposed behavior of distcp wrt raw xattrs: . There is no -pd option. . Determination of whether or not to preserve raw.* xattrs is based only on whether all of the source and target pathnames are in the /.reserved/raw hierarchy. . If any of the sources or target paths are different wrt polarity of /.r/r then an exception is thrown.
          Hide
          Charles Lamb added a comment -

          Here's a patch to implement the -pd option for cp, similar to distcp -pd.

          Show
          Charles Lamb added a comment - Here's a patch to implement the -pd option for cp, similar to distcp -pd.

            People

            • Assignee:
              Charles Lamb
              Reporter:
              Charles Lamb
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development