I'm glad you filed this. I was just getting frustrated with this issue myself in the last couple weeks and have various thoughts on the issue. Some of these ideas are raw and flawed, but here is what I have been thinking:
Ideally, the framework would limit the classes visible to a job to the minimum required for job execution. A job could then bring in its own dependencies. Also, if there was a built-in hadoop dependency hidden by default that a job wanted, it could request access to it.
Similarly frustrating and related, is how a M/R job has to submit its whole job jar to the cluster each time. I have a 28MB jar, and a workflow of about 35 dependent M/R jobs (A DAG of them). Towards the end of this chain, the jobs get smaller and smaller in data size (the end ones are joining, augmenting, transforming and sorting data aggregated by the earlier jobs).
Two big things account for more clock time than the 'heavy lifting' work of the initial 'big data' jobs – job submission time and scheduling inefficiencies. The former is related to dependency management.
If the framework could support installing jars into an 'application' classloader space and then jobs reference that space, task latency could be reduced significantly as each job submission would not need to also submit all its dependency jars. In my case, the job jar would probably become a couple hundred K instead of almost 30MB – or even zero K if the jobs could just be stored and called. TaskTracker nodes could cache these application library spaces to reduce job start-up time.
In some ways, the dependency management above is like an application server. Each 'application' has its own classloader space, and there might be several different jobs available in an 'application' – analogous to several servlets available in a web app. Like an app server, there will probably be a need for a lib directory that is global, one that is exclusive to the framework, and a per-application space.
There are some questions related to static variables related to such classloader partitioning. With shared JVM's across tasks, users expect statics to live from one task to another in the same job. This means the classloader in a JVM corresponds with the Job ID and whether it is a M or R. Per-Job classloaders could enable JVM recycling across jobs in the distant future because disposing of a Job's classloader will free its static variables. That in turn leads to the possibility of future reductions in start-up time and per task costs.