Hi Chris Douglas
Thank you for the feedback. Updating the patch to address most of your nits :
There are many counterexamples, but running a MR job is a heavy way to test this
Agreed. But a MR Job will ensure all code paths are handled. I will be adding testcases to existing classs (for eg. TestMerger) to validate that the merging works fine with Shuffle turned on. But I don't see too many tests cases to validate that mapOutput spillfiles are correctly being partitioned and sent to the correct reduces.
Has this been tested on spills with intermediate merges? With more than a single reduce? Looking at the patch, it looks like it creates the stream with the IV, it doesn't reset the IV for each segment (apologies, I haven't tried applying it, so I might just be misreading the context).
Modifying the TestMerger class to use the CryptoShuffle will ensure the former. The current Test case Included with the patch tests with multiple reducers.. I will refactor it a bit to explicitly test these scenarios
To make it backwards compatible, the IV can be part of each IFile segment (requiring no changes to ShuffleHandler or the SpillRecord/IndexRecord format), or the IVs can be added to the end of the SpillRecord. In the latter case, the Fetcher will need to request that the alternate interpretation by including a header; old versions will get the existing interpretation of the SpillRecord.
As per your suggestion, I was actually able to get the end to end flow working without having to touch ShuffleHandler, ShuffleHeader or IndexRecord. Although, what I did was add the IV to the prefix of an IFile before it is written.. and during Segment::init() when it is read from disk. Only nit is I have to do some amount of book-keeping on the MapTask and Fetcher to add/remove the 16 bytes.
Since the IV size is hard-coded in CryptoUtils to 16 bytes (and part of the IndexRecord format), it should probably fail if the CryptoCodec::getAlgorithmBlockSize returns anything else.
Yup.. this would have been an issue had I had to modify the IndexRecord/ShuffleHeader. But now we don't, so this is not an issue anymore