Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
0.20.203.0
-
None
-
None
Description
We put some file into Distributed Cache, but sometimes, only Map or Reduce use thses cached files, not useful for both. but TaskTracker always download cached files from HDFS, if there are some little bit big files in cache, it's time expensive.
so, this patch add some new API in the DistributedCache.java as follow:
addArchiveToClassPathForMap
addArchiveToClassPathForReduce
addFileToClassPathForMap
addFileToClassPathForReduce
addCacheFileForMap
addCacheFileForReduce
addCacheArchiveForMap
addCacheArchiveForReduce
New API doesn't affect original interface. User can use these features like the following two methods:
1)
hadoop job **** -files file1 -mapfiles file2 -reducefiles file3 -archives arc1 -maparchives arc2 -reduce archives arc3
2)
DistributedCache.addCacheFile(conf, file1);
DistributedCache.addCacheFileForMap(conf, file2);
DistributedCache.addCacheFileForReduce(conf, file3);
DistributedCache.addCacheArchives(conf, arc1);
DistributedCache.addCacheArchivesForMap(conf, arc2);
DistributedCache.addCacheFArchivesForReduce(conf, arc3);
These two methods have the same result, That's mean:
You put six files to the distributed cache: file1 ~ file3, arc1 ~ arc3,
but file1 and arc1 are cached for both map and reduce;
file2 and arc2 are only cached for map;
file3 and arc3 are only cached for reduce;