[MAPREDUCE-3323] Add new interface for Distributed Cache, which special for Map or Reduce,but not Both. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 0.20.203.0
Fix Version/s: None
Component/s: distributed-cache, tasktracker
Labels:
None

Release Note:

Hide
Tested as follow:

1: Add cache file for map/reduce;
2: get cache files in the configure of the map and reduce,
    then print some messages if map/reduce can get cache file or not.

3: Three test cases: cache for mapred, cache for map, cache for reduce
    For the first case, both map and reduce can get local files from the distributed cache.
    The second case, Map Task can get local files from the distributed cache, but reduce can not.
    You know what happened during the third case.

conclusion: It does work well.

Show
Tested as follow: 1: Add cache file for map/reduce; 2: get cache files in the configure of the map and reduce,     then print some messages if map/reduce can get cache file or not. 3: Three test cases: cache for mapred, cache for map, cache for reduce     For the first case, both map and reduce can get local files from the distributed cache.     The second case, Map Task can get local files from the distributed cache, but reduce can not.     You know what happened during the third case. conclusion: It does work well.

Description

We put some file into Distributed Cache, but sometimes, only Map or Reduce use thses cached files, not useful for both. but TaskTracker always download cached files from HDFS, if there are some little bit big files in cache, it's time expensive.

so, this patch add some new API in the DistributedCache.java as follow:

addArchiveToClassPathForMap
addArchiveToClassPathForReduce

addFileToClassPathForMap
addFileToClassPathForReduce

addCacheFileForMap
addCacheFileForReduce

addCacheArchiveForMap
addCacheArchiveForReduce

New API doesn't affect original interface. User can use these features like the following two methods:

1)
hadoop job **** -files file1 -mapfiles file2 -reducefiles file3 -archives arc1 -maparchives arc2 -reduce archives arc3

2)
DistributedCache.addCacheFile(conf, file1);
DistributedCache.addCacheFileForMap(conf, file2);
DistributedCache.addCacheFileForReduce(conf, file3);

DistributedCache.addCacheArchives(conf, arc1);
DistributedCache.addCacheArchivesForMap(conf, arc2);
DistributedCache.addCacheFArchivesForReduce(conf, arc3);

These two methods have the same result, That's mean:

You put six files to the distributed cache: file1 ~ file3, arc1 ~ arc3,
but file1 and arc1 are cached for both map and reduce;
file2 and arc2 are only cached for map;
file3 and arc3 are only cached for reduce;

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Azuryy(Chijiong)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 01/Nov/11 08:26

Updated:: 10/Mar/15 03:06

Resolved:: 10/Mar/15 03:06