Affects Version/s: 0.6.0
Fix Version/s: None
Something that's been bothering me recently is how we can support multiple YARN 2.X clusters, all on different versions.
At LinkedIn, we run integration and production YARN clusters for Samza on a specific version of YARN, but have a test cluster running a different (usually newer) version. For example, we might run integration and production clusters on 2.0.5-alpha, but the test cluster on 2.1.0.
The problem we have right now is that it's somewhat difficult to build a binary that can run on all of these clusters. Samza packages are deployed through a .tgz file with a bin and lib directory. The lib directory contains all jars (including Hadoop) that the job needs to run. The bin directory has a run-class.sh script in it, which adds ../lib/*.jar to the classpath.
To complicate matters, YARN's 2.X versions are all backwards incompatible at the protocol level. YARN 2.0.4-alpha can't run on a 2.0.5-alpha cluster, and vice-versa. At the API-level (for the APIs Samza uses) 2.0.3-alpha, 2.0.4-alpha, and 2.0.5-alpha are all compatible (i.e. you can compile and run Samza with any one of these versions). I haven't tested 2.1.0-beta, yet, but I assume it's API compatible with 2.0.3-alpha an onward (should be verified, though). As a result, the incompatibility issue appears to be purely runtime. YARN clusters simple reject RPC calls from protobuf RPC versions that aren't the same as the version the YARN cluster is running (2.0.5-alpha clusters reject all non-2.0.5-alpha RPC calls).
Given this set of facts, right now you have to cross-build a Samza tarball for each version of Hadoop that you're running on (my-job-yarn-2.0.3-alpha.tgz, my-job.yarn-2.0.4-alpha.tgz, etc). This is kind of annoying, breaks reproducibility somewhat. Generally, testing on one binary, and running in production on another is kind of a bad idea, even if they're theoretically exactly the same, minus the YARN jars.
I was thinking a better solution might be to:
1. Change the YarnConfig/YarnJob/ClientHelper to allow arbitrary resources with a config scheme like this: yarn.resource.<resource-name>.path=...
2. Change run-class.sh to add both ../lib/.jar and ../.../hadoop-libs/.jar
This seems like a good change to me because:
1. It allows jobs to attach arbitrary resources in addition to hadoop-libs.
2. It allows a single binary to be run with multiple versions of hadoop (vs one binary per Hadoop/YARN version).
3. If a non-manual (automated, or CI-based) job deployment mechanism is used, which automatically runs run-job.sh, then that deployment mechanism can auto-inject the appropriate version of YARN (using --config yarn.resource.hadoop-libs.path=...) based on the YARN cluster that the deployment mechanism is executing the job on.
4. If run-job.sh is being run by a user, they can use either the --config based approach (described in number 3), or they can use a config-based approach (test.properties has a path to YARN 2.1.0-beta, and prod.properties has a path to YARN 2.0.5-alpha).
5. It means that rolling out a new version of YARN doesn't require a re-build of every job package that runs on the cluster. You just need to update the config path to point to the new hadoop
6. This would theoretically allow us to switch from a JSON/environment variable config-passing approach to a distributed cache config-passing approach (config is a file on HDFS, or HTTP, or wherever). Right now, config is resolved at job-execution time (run-job.sh/JobRunner) into key/value pairs, and is passed to the AM/containers via environment variable as a JSON blob. This is kind of hacky, has size limits, and probably not the safest thing to do. A better (but perhaps less convenient) approach might be to require config to be a file which gets attached and un-tar'ed along with hadoop-libs and the job package. I haven't fully thought the file-based config feature through, but the proposed change would allow us to support this.