1. If the previous stage using 1 reduce, no need to add one more vertex
I think in particular this should be that if the cardinality of the previous Tez vertex (i.e. the number of run-time tasks generated by the previous vertex) equals 1, then there is no more need to add another vertex. Does this still sound like an optimization that we should still implement, and if so, how would we check such a thing?
2. If the limitplan is null (ie, not the "limited order by" case), we might not need a shuffle edge, a pass through edge should be enough if possible
I'm reading this to mean that a sort isn't necessary on the edge. Daniel - the way you wrote this, it sounds like you want to set the edge to be a broadcast edge rather than a scatter-gather or one-to-one edge. However, since the parallelism of the the final vertex that implements LIMIT is one, I don't think any of these really make a difference (correct me if I am wrong). Thus, I'm reading this to mean that we don't need to do any sorting on the edge (LIMIT explicitly says that it might return any order).
For my patch, I am changing the input / output types to be OnFileUnorderedKVOutput / ShuffledUnorderedKVInput and leaving the edge type to SCATTER_GATHER. However, I have these lines commented out with a note that they should be uncommented after
TEZ-661. Does this sound like the right thing to do?
3. Similar to
PIG-1270, we can push limit to InputHandler
This is done by the LimitOptimizer already, I have gone through and verified it.
4. We also need to think through the "limited order by" case once "order by" is implemented
There is quite a bit of code to handle LIMIT in TezCompiler::getSortJobs. Is this code already sufficient?