[MAPREDUCE-1901] Jobs should not submit the same jar files over and over again - ASF JIRA

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

Currently each Hadoop job uploads the required resources (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in executing this job would then download these resources into local disk.

In an environment where most of the users are using a standard set of jars and files (because they are using a framework like Hive/Pig) - the same jars keep getting uploaded and downloaded repeatedly. The overhead of this protocol (primarily in terms of end-user latency) is significant when:

the jobs are small (and conversantly - large in number)
Namenode is under load (meaning hdfs latencies are high and made worse, in part, by this protocol)

Hadoop should provide a way for jobs in a cooperative environment to not submit the same files over and again. Identifying and caching execution resources by a content signature (md5/sha) would be a good alternative to have available.