Jian He mentioned this offline and the configuration approach concerns me too.
Stepping back, I think the current discovery of Scheduler by the apps is completely broken. Distributed Shell for e.g. works only because it is a java application and NM happens to put HADOOP_CONF_DIR in the classpath. Irrespective of this JIRA, we need to fix the scheduler discovery for the apps. The current way of depending on server configuration is unreliable in the face of rolling-upgrades.
The specific solution in this JIRA further breaks rolling-upgrades and configuration updates. If and when, an admin forces client configuration changes, the config written by the Node will go out of sync. This overall makes the situation worse.
I'd suggest that we start moving towards a better scheduler-discovery model. We have already done similar work with Timeline service (
YARN-3039). We can implement part of that here - an environment based discovery - we can simply have an environment say YARN_SCHEDULER_ADDRESS for now set by the NodeManager into the AM-env, that is respected as the first level discovery mechanism. As we add more first-class discovery mechanisms, this env can take lesser precedence. This approach isn't too far from your current solution too, instead of pointing to a conf-dir env, you are pointing to a scheduler-address env directly.