[HADOOP-1032] Support for caching Job JARs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 0.11.2
Fix Version/s: 0.12.0
Component/s: None
Labels:
None

Description

Often jobs need to be rerun number of times.. like a job that reads from crawled data time and again.. so having to upload job jars to every node is cumbersome. We need a caching mechanism to boost performance. Here are the features for job specific caching of jars/conf files..

Ability to resubmit jobs with jars without having to propagate same jar to all nodes.
The idea is to keep a store(path mentioned by user in job.xml?) local to the task node so as to speed up task initiation on tasktrackers. Assumes that the jar does not change during an MR task.

An independent DFS store to upload jars to (Distributed File Cache?).. that does not cleanup between jobs.
This might need user level configuration to indicate to the jobclient to upload files to DFSCache instead of the DFS. https://issues.apache.org/jira/browse/HADOOP-288 facilitates this. Our local cache can be client to the DFS Cache.

A standard cache mechanism that checks for changes in the local store and picks from dfs if found dirty.
This does away with versioning. The DFSCache supports a md5 checksum check, we can use that.

Anything else? Suggestions? Thoughts?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-1032_2.patch
28/Feb/07 13:42
3 kB
Gautam Kowshik
HADOOP-1032_3.patch
28/Feb/07 17:09
2 kB
Gautam Kowshik
HADOOP-1032_4.patch
01/Mar/07 11:20
5 kB
Gautam Kowshik
HADOOP-1032.patch
27/Feb/07 18:37
3 kB
Gautam Kowshik

Activity

People

Assignee:: Gautam Kowshik

Reporter:: Gautam Kowshik

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 23/Feb/07 07:51

Updated:: 08/Jul/09 16:52

Resolved:: 02/Mar/07 20:04