Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3323

Add new interface for Distributed Cache, which special for Map or Reduce,but not Both.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 0.20.203.0
    • None
    • None
    • Hide
      Tested as follow:

      1: Add cache file for map/reduce;
      2: get cache files in the configure of the map and reduce,
          then print some messages if map/reduce can get cache file or not.

      3: Three test cases: cache for mapred, cache for map, cache for reduce
          For the first case, both map and reduce can get local files from the distributed cache.
          The second case, Map Task can get local files from the distributed cache, but reduce can not.
          You know what happened during the third case.

      conclusion: It does work well.
       
      Show
      Tested as follow: 1: Add cache file for map/reduce; 2: get cache files in the configure of the map and reduce,     then print some messages if map/reduce can get cache file or not. 3: Three test cases: cache for mapred, cache for map, cache for reduce     For the first case, both map and reduce can get local files from the distributed cache.     The second case, Map Task can get local files from the distributed cache, but reduce can not.     You know what happened during the third case. conclusion: It does work well.  

    Description

      We put some file into Distributed Cache, but sometimes, only Map or Reduce use thses cached files, not useful for both. but TaskTracker always download cached files from HDFS, if there are some little bit big files in cache, it's time expensive.

      so, this patch add some new API in the DistributedCache.java as follow:

      addArchiveToClassPathForMap
      addArchiveToClassPathForReduce

      addFileToClassPathForMap
      addFileToClassPathForReduce

      addCacheFileForMap
      addCacheFileForReduce

      addCacheArchiveForMap
      addCacheArchiveForReduce

      New API doesn't affect original interface. User can use these features like the following two methods:

      1)
      hadoop job **** -files file1 -mapfiles file2 -reducefiles file3 -archives arc1 -maparchives arc2 -reduce archives arc3

      2)
      DistributedCache.addCacheFile(conf, file1);
      DistributedCache.addCacheFileForMap(conf, file2);
      DistributedCache.addCacheFileForReduce(conf, file3);

      DistributedCache.addCacheArchives(conf, arc1);
      DistributedCache.addCacheArchivesForMap(conf, arc2);
      DistributedCache.addCacheFArchivesForReduce(conf, arc3);

      These two methods have the same result, That's mean:

      You put six files to the distributed cache: file1 ~ file3, arc1 ~ arc3,
      but file1 and arc1 are cached for both map and reduce;
      file2 and arc2 are only cached for map;
      file3 and arc3 are only cached for reduce;

      Attachments

        Activity

          People

            Unassigned Unassigned
            fengdongyu@gmail.com Azuryy(Chijiong)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: