Asokan, sorry I've been away traveling home during the holidays and hence the delay.
I have more comments, but I'll put some here to keep the discussion going.
Thanks for the design doc, but I was looking for thoughts on how the plugin was going used for use-cases you've mentioned (hash-join etc.), alternatives on design etc.
IAC, taking a step back, the 'goal' here is to make the 'merge' pluggable.
Reduce-side has 2 pieces:
- Shuffle - Move data from maps to the reduce.
- Merge - Merge already sorted map-outputs.
The rest (MergeManager etc.) are merely implementation details to manage memory etc., which are irrelevant in several scenarios as soon as we consider alternatives to the current HTTP-based shuffle (several alternatives exist such RDMA etc.).
Your current approach tries to encapsulate and enshrine the current implementation of the reduce task, which I'm not wild about. By this I mean, you are focussing too much on the current state and trying to make interfaces which are unnecessary for now and might not suffice for the future.
I really don't think we should be tying Shuffle & Merge as you have done by introducing yet another new interface (regardless of whether it's public or not).
As I've noted above, adding a simple 'Merge' interface with one 'merge' call will address all of the use-cases you have outlined. If not, let's discuss.