Affects Version/s: None
Fix Version/s: None
Steve pointed out to me that the s3 libraries buffer data to disk. This is pretty much arbitrary user data.
Spark has some settings to encrypt data that it writes to local disk (shuffle files etc.). Spark never has control of what arbitrary libraries are doing with data, so it doesn't guarantee that nothing ever ends up on disk – but to the end user, they'd view those s3 libraries as part of the same system. So if a user is turning on spark's local-disk encryption, the users would be pretty surprised to find out that the data they're writing to S3 ends up on local-disk, unencrypted.
... Regardless, this is still an s3a bug.
we need to save intermediate data "somewhere" -people get a choice of disk or memory.
encrypting data on disk was never considered as needed, on the basis that anyone malicious with read access under your home dir could lift the hadoop token file which YARN provides and so have full R/W access to all your data in the cluster filesystems until those tokens expire. If you don't have a good story there then the buffering of a few tens of MB of data during upload is a detail.
There's also the extra complication that when uploading file blocks, we pass in the filename to the AWS SDK and let it do the uploads, rather than create the output stream; the SDK code has, in the past, been better at recovering failures there than output stream + mark and reset. that was a while back; things may change. But it is why I'd prefer any encrypted temp store as a new buffer option, rather than just silently change the "disk" buffer option to encrypt
Be interesting to see where else in the code this needs to be addressed; I'd recommend looking at all uses if org.apache.hadoop.fs.LocalDirAllocator and making sure that Spark YARN launch+execute didn't use this indirectly
JIRAs under HADOOP-15620 welcome; do look at the test policy in the hadoop-aws docs; we'd need a new subclass of AbstractSTestS3AHugeFiles for integration testing a different buffering option, plus whatever unit tests the encryption itself needed.
I get it. But ... there are a couple of subtleties here. One is that the tokens expire, while the data is still data. (This might or might not matter, depending on the threat...) Another is that customer policies in this area do not always align well with common sense. There are blanket policies like "data shall never be written to disk unencrypted" which we have come up against, which we'd like to be able to honestly answer in the affirmative. We have encrypted MR shuffle as one historical example, and encrypted impala memory spills as another.