Attaching patch that adds emulation of distributed cache load in gridmix simulated jobs.
High level details of what this patch does are:
(1) New gridmix configuration property "gridmix.distributed-cache-emulation.enable" is added, whose default value is true. Setting it to false disables emulation of distributed cache load. Irrespective of this config property setting, with -generate option, distributed cache files are generated on HDFS by gridmix.
Distributed Cache Emulation is disabled for the case of '-' as input trace(i.e. stdin stream instead of file).
Distributed Cache Emulation is disabled for the case where <iopath> is on local file system.
(2) Behavior of the option -generate is changed. -generate option means (a) generate input data in the directory
<iopath>/input/ and (b) generate distributed cache data needed for emulation of distributed cache load of this
trace file in the directory <iopath>/distributedCache/.
For (a), same old GenerateData MR job is used.
For (b), a new MR job GenerateDistCacheData is added, which is run after GenerateData and before submission of simulated jobs.
With -generate option, (a) existence of <iopath>/input/ directory gives an error, similar to current behavior and
(b) existence of <iopath>/gridmixDistCache/ directory is not an error and leads to generation of only the missing/nonexisting distributed cache files under <iopath>/gridmixDistCache/ for the specific trace file. If all the needed distributed cache files are already
there, then submission of GenerateDistCacheData job is skipped.
Without -generate option, if emulation of distributed cache load is enabled, then gridmix checks if all the needed distributed cache files are available under <iopath>/distributedCache/ and emits an error if any of the expected files are missing.
(3) setupDistCacheEmulation : Read the trace file and build a list of distributed cache file paths and their file sizes. The
file paths are the mapped paths on the simulated cluster(mapped from original cluster's paths to simulated cluster's
for public distributed cache files
for private distributed cache files.
This list of mappeed file paths along with the file sizes is written to a special file
<iopath>/distributedCache/_distCacheFiles.txt and the file name can be configured using
So this means all distributed cache files in the gridmix simulated jobs are public distributed cache files but for each private distributed cache file of a user of the original cluster (i.e. from trace file), there will be a different public distributed cache file on gridmix simulated cluster.
(4) GenerateDistCacheData : The MR job (launched by gridmix if -generate option is seen) that generates distributed cache data files on HDFS. Input to this job is the special file _distCacheFiles.txt that contains the distributed cache file paths and their sizes.
Each map() call generates one distributed cache file.
(5) configureDistCacheFiles : The mapped distributed cache files' paths are configured for the simulated jobs' configrations sothat MapReduce framework takes care of adding the actual distributed cache load equivalent to original cluster's distributed cache load.