[MAPREDUCE-1901] Jobs should not submit the same jar files over and over again - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

Currently each Hadoop job uploads the required resources (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in executing this job would then download these resources into local disk.

In an environment where most of the users are using a standard set of jars and files (because they are using a framework like Hive/Pig) - the same jars keep getting uploaded and downloaded repeatedly. The overhead of this protocol (primarily in terms of end-user latency) is significant when:

the jobs are small (and conversantly - large in number)
Namenode is under load (meaning hdfs latencies are high and made worse, in part, by this protocol)

Hadoop should provide a way for jobs in a cooperative environment to not submit the same files over and again. Identifying and caching execution resources by a content signature (md5/sha) would be a good alternative to have available.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

1901.PATCH
05/Nov/10 23:43
65 kB
Junjie Liang
1901.PATCH
22/Jul/10 22:13
55 kB
Junjie Liang

Issue Links

is duplicated by

YARN-1492 truly shared cache for jars (jobjar/libjar)

Resolved

relates to

MAPREDUCE-1902 job jar file is not distributed via DistributedCache

Open

Activity

People

Assignee:: Unassigned

Reporter:: Joydeep Sen Sarma

Votes:: 0 Vote for this issue

Watchers:: 30 Start watching this issue

Dates

Created:: 30/Jun/10 19:21

Updated:: 13/Nov/15 18:57

Resolved:: 13/Nov/15 18:57