Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
None
-
None
-
None
-
None
Description
Currently each Hadoop job uploads the required resources (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in executing this job would then download these resources into local disk.
In an environment where most of the users are using a standard set of jars and files (because they are using a framework like Hive/Pig) - the same jars keep getting uploaded and downloaded repeatedly. The overhead of this protocol (primarily in terms of end-user latency) is significant when:
- the jobs are small (and conversantly - large in number)
- Namenode is under load (meaning hdfs latencies are high and made worse, in part, by this protocol)
Hadoop should provide a way for jobs in a cooperative environment to not submit the same files over and again. Identifying and caching execution resources by a content signature (md5/sha) would be a good alternative to have available.
Attachments
Attachments
Issue Links
- is duplicated by
-
YARN-1492 truly shared cache for jars (jobjar/libjar)
- Resolved
- relates to
-
MAPREDUCE-1902 job jar file is not distributed via DistributedCache
- Open