I agree with Patrick Wendell that it does not help spark to introduce dependency of tez on core.
Oleg Zhurakousky is tez available on all yarn clusters ? Or is it an additional runtime dependency ?
If it is available by default - we can make it a runtime switch to use tez for jobs running on yarn-standalone and yarn-client mode.
But before that ...
While better multi-tennancy would be a likely benefit - my specific interest in this patch is more to do with the much better shuffle performance that tez offers Specicially for ETL jobs, I can see other benefits which might be relevant - one of our collaborative filtering implementation, though not ETL, comes fairly close to it in job characterstics and suffers due to some of our shuffle issues ...
As I alluded to, I do not think we should have an openended extension point - where any class name can be provided which extends functionality in arbitrary manner - for example, like the spi we have for compression codecs.
As Patrick mentioned, this gives the impression that the approach is blessed by spark developers - even if tagged with Experimental.
Particularly with core internals, I would be very wary of exposing them via an spi - simply because we need the freedom to evolve them for performance or functionality reasons.
On other hand, I am in favour of exploring this option to see what sort of benefits we get out of this assuming it has been prototyped already - which I thought was the case here, though I am yet to see a PR with that (not sure if I missed it !).
Given that Tez is supposed to be reasonably mature - if there is a spark + tez version, I want to see what benefits (if any) are observed as a result of this effort.
I had discussed spark + tez integration about an year or so back with Matei - but at that time, tez was probably not that mature - maybe this is a better time !
Oleg Zhurakousky Do you have a spark on tez prototype done already ? Or is this an experiment you are yet to complete ? If complete, what sort of performance difference do you see ? What metrics are you using ?
If there are significant benefits, I would want to take a closer look at the final proposed patch ... I would be interested in it making into spark in some form.
As Nicholas Chammas mentioned - if it is possible to address it in spark directly, nothing like it - particularly since it will benefit all modes of execution and not just yarn + tez combination.
If the gap cant be narrowed, and the benefits are significant (for some, as of now underfined, definition of "benefits" and "significant") - then we can consider tez dependency in yarn module.
Ofcourse, all these questions are moot - until we have better quantitative judgement of what the expected gains are and what the experimental results are.