Thanks for the review, Hitesh!
Why does classpath need to include all of common, hdfs and yarn jar locations? Assuming that MR is running on a YARN-based cluster, shouldn't the location of the core dependencies come from the cluster deployment i.e. via the env that the NM sets for a container. I believe the only jars that MR should have in its uploaded tarball should be the client jars. I understand that there is no clear boundary for client-side only jars for common and hdfs today ( for For YARN, I believe it should be simple to split out the client-side requirements ) but it is something we should aim for or assume that the jars deployed on the cluster are compatible.
This is primarily for avoiding jar conflicts and removing dependencies on the nodes. If the cluster upgrades and picks up a new version of jackson/jersey/guava/name-your-favorite-jar-that-breaks-apps-when-updated then that means existing apps can suddenly break due to jar conflicts. Another case we've seen is when a dependency jar is dropped between versions, and apps were depending upon those to be provided by Hadoop. Having the apps provide all of their dependencies means we can focus on just the RPC layer compatibilities (something we have to solve anyway) rather than have to worry as well about the myriad of combinations between jars within the app and those being picked up from the nodes.
However if desired the user could configure it to work with just a partial tarball by setting the classpath to pickup the jars on the nodes via HADOOP_COMMON_HOME/HADOOP_HDFS_HOME/HADOOP_YARN_HOME references in the classpath like MRApps is doing today.
I would vote to make the tar-ball in HDFS be the only way to run MR on YARN. Obviously, this cannot be done for 2.x but we should move to this model on trunk and not support the current approach at all there. Comments?
I'm all for it, and I see this as being a stepping stone to getting there. We'd like to have the ability to run out of HDFS in 2.x as a potential way to do a rolling upgrade of bugfixes in the MR framework. It probably won't be a complete solution to all forms of upgrades (i.e.: what if the client code or ShuffleHandler needs the fix), but it could still be very useful in practice.
The other point is related to configs.
Yes, final parameter configs on the nodes conflicting with the job.xml settings are another concern. In practice I don't expect that to be a common issue, but it is something we should try to address in a followup JIRA.
How do you see framework name extracted from the path to be used? Is it just a safety check to ensure that it is found in the classpath? Will it have any relation to a version?
I see the framework fragment "alias" primarily used for sanity-checks in case the classpath wasn't updated when using a specified framework and to allow the classpath settings to be a bit more general. For example, ops could configure the classpath once based on an expected framework tarball layout (e.g.: mrframework/share/mapreduce/* : mrframework/share/mapreduce/lib/* etc) and different versions of the tarball can be used without modifying the classpath as long as they match that layout. e.g.: mrtarball-2.3.1.tgz#mrframework, mrtarball-2.3.4.tgz#mrframework, etc. It's sort of like the assumed-layout approach from your last comment. Ops could set the classpath and users could select the framework version without having to set the classpath as long as the layout is compatible. Users could still override the classpath if using a framework that isn't compatible with the assumed layout.
One problem with the common classpath approach is that the archives need to have the same directory structure, so top-level directories with the version number in them break it. The tarballs deployed to HDFS would have to be reorganized to have a common dir name rather than the versioned name. Not difficult to do, but it is annoying.
A minor nit - framework name seems confusing in relation to the framework name in use from earlier i.e yarn vs local framework.
Yeah, that's true. I'm open to suggestions for what to call this instead of framework.
Regarding versions, it seems like users will need to do 2 things. Change the location of the tarball on HDFS and modify the classpath. Users will need to know the exact structure of the classpath. In such a scenario, do defaults even make sense?
I wanted this to be flexible so ops/users could decide how to organize the framework (i.e.: partial/complete tarball, monolithic jar, whatever) and be able to set the classpath accordingly. I thought about hardcoding the assumption of the layout, but then that imposes a burden on future frameworks to match that (potentially obsolete) layout and is less flexible. The cost of additional flexibility is additional configuration burden, so it's a tradeoff. That's why I was trying to use the framework fragment alias so the classpath could be set once and hopefully reused for quite some time.
I suppose one compromise between the two approaches would be a standard file located alongside the archive that the client can read to determine the corresponding classpath settings. There's probably lots of approaches to automatically determining the classpath for an archive, and I'm interested to hear thoughts on it.