Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2407

Make Gridmix emulate usage of Distributed Cache files

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: contrib/gridmix
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Makes Gridmix emulate HDFS based distributed cache files and local file system based distributed cache files.

      Description

      Currently Gridmix emulates disk IO load only. This JIRA is to make Gridmix emulate Distributed Cache load as defined by the job-trace.

      1. 2407.v1.1.patch
        100 kB
        Ravi Gummadi
      2. 2407.v1.patch
        100 kB
        Ravi Gummadi
      3. 2407.patch
        99 kB
        Ravi Gummadi

        Issue Links

          Activity

          Hide
          Ravi Gummadi added a comment -

          Attaching patch that adds emulation of distributed cache load in gridmix simulated jobs.

          High level details of what this patch does are:

          (1) New gridmix configuration property "gridmix.distributed-cache-emulation.enable" is added, whose default value is true. Setting it to false disables emulation of distributed cache load. Irrespective of this config property setting, with -generate option, distributed cache files are generated on HDFS by gridmix.
          Distributed Cache Emulation is disabled for the case of '-' as input trace(i.e. stdin stream instead of file).
          Distributed Cache Emulation is disabled for the case where <iopath> is on local file system.

          (2) Behavior of the option -generate is changed. -generate option means (a) generate input data in the directory
          <iopath>/input/ and (b) generate distributed cache data needed for emulation of distributed cache load of this
          trace file in the directory <iopath>/distributedCache/.
          For (a), same old GenerateData MR job is used.
          For (b), a new MR job GenerateDistCacheData is added, which is run after GenerateData and before submission of simulated jobs.

          With -generate option, (a) existence of <iopath>/input/ directory gives an error, similar to current behavior and
          (b) existence of <iopath>/gridmixDistCache/ directory is not an error and leads to generation of only the missing/nonexisting distributed cache files under <iopath>/gridmixDistCache/ for the specific trace file. If all the needed distributed cache files are already
          there, then submission of GenerateDistCacheData job is skipped.

          Without -generate option, if emulation of distributed cache load is enabled, then gridmix checks if all the needed distributed cache files are available under <iopath>/distributedCache/ and emits an error if any of the expected files are missing.

          (3) setupDistCacheEmulation : Read the trace file and build a list of distributed cache file paths and their file sizes. The
          file paths are the mapped paths on the simulated cluster(mapped from original cluster's paths to simulated cluster's
          paths using

          MD5Hash(filePath+timestamp)

          for public distributed cache files
          and

          MD5Hash(filePath+timestamp+username)

          for private distributed cache files.

          This list of mappeed file paths along with the file sizes is written to a special file
          <iopath>/distributedCache/_distCacheFiles.txt and the file name can be configured using
          "gridmix.distcache.file.list".

          So this means all distributed cache files in the gridmix simulated jobs are public distributed cache files but for each private distributed cache file of a user of the original cluster (i.e. from trace file), there will be a different public distributed cache file on gridmix simulated cluster.

          (4) GenerateDistCacheData : The MR job (launched by gridmix if -generate option is seen) that generates distributed cache data files on HDFS. Input to this job is the special file _distCacheFiles.txt that contains the distributed cache file paths and their sizes.
          Each map() call generates one distributed cache file.

          (5) configureDistCacheFiles : The mapped distributed cache files' paths are configured for the simulated jobs' configrations sothat MapReduce framework takes care of adding the actual distributed cache load equivalent to original cluster's distributed cache load.

          Show
          Ravi Gummadi added a comment - Attaching patch that adds emulation of distributed cache load in gridmix simulated jobs. High level details of what this patch does are: (1) New gridmix configuration property "gridmix.distributed-cache-emulation.enable" is added, whose default value is true. Setting it to false disables emulation of distributed cache load. Irrespective of this config property setting, with -generate option, distributed cache files are generated on HDFS by gridmix. Distributed Cache Emulation is disabled for the case of '-' as input trace(i.e. stdin stream instead of file). Distributed Cache Emulation is disabled for the case where <iopath> is on local file system. (2) Behavior of the option -generate is changed. -generate option means (a) generate input data in the directory <iopath>/input/ and (b) generate distributed cache data needed for emulation of distributed cache load of this trace file in the directory <iopath>/distributedCache/. For (a), same old GenerateData MR job is used. For (b), a new MR job GenerateDistCacheData is added, which is run after GenerateData and before submission of simulated jobs. With -generate option, (a) existence of <iopath>/input/ directory gives an error, similar to current behavior and (b) existence of <iopath>/gridmixDistCache/ directory is not an error and leads to generation of only the missing/nonexisting distributed cache files under <iopath>/gridmixDistCache/ for the specific trace file. If all the needed distributed cache files are already there, then submission of GenerateDistCacheData job is skipped. Without -generate option, if emulation of distributed cache load is enabled, then gridmix checks if all the needed distributed cache files are available under <iopath>/distributedCache/ and emits an error if any of the expected files are missing. (3) setupDistCacheEmulation : Read the trace file and build a list of distributed cache file paths and their file sizes. The file paths are the mapped paths on the simulated cluster(mapped from original cluster's paths to simulated cluster's paths using MD5Hash(filePath+timestamp) for public distributed cache files and MD5Hash(filePath+timestamp+username) for private distributed cache files. This list of mappeed file paths along with the file sizes is written to a special file <iopath>/distributedCache/_distCacheFiles.txt and the file name can be configured using "gridmix.distcache.file.list". So this means all distributed cache files in the gridmix simulated jobs are public distributed cache files but for each private distributed cache file of a user of the original cluster (i.e. from trace file), there will be a different public distributed cache file on gridmix simulated cluster. (4) GenerateDistCacheData : The MR job (launched by gridmix if -generate option is seen) that generates distributed cache data files on HDFS. Input to this job is the special file _distCacheFiles.txt that contains the distributed cache file paths and their sizes. Each map() call generates one distributed cache file. (5) configureDistCacheFiles : The mapped distributed cache files' paths are configured for the simulated jobs' configrations sothat MapReduce framework takes care of adding the actual distributed cache load equivalent to original cluster's distributed cache load.
          Hide
          Santosh Kumar added a comment -

          I will take it up from here. Please grant me the commit access.

          Show
          Santosh Kumar added a comment - I will take it up from here. Please grant me the commit access.
          Hide
          Ravi Gummadi added a comment -

          Amar, Would you please review the patch ? Thanks.

          Show
          Ravi Gummadi added a comment - Amar, Would you please review the patch ? Thanks.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12478943/2407.patch
          against trunk revision 1125223.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 10 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          -1 release audit. The applied patch generated 3 release audit warnings (more than the trunk's current 2 warnings).

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/283//testReport/
          Release audit warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/283//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/283//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/283//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12478943/2407.patch against trunk revision 1125223. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 10 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. -1 release audit. The applied patch generated 3 release audit warnings (more than the trunk's current 2 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/283//testReport/ Release audit warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/283//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/283//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/283//console This message is automatically generated.
          Hide
          Ravi Gummadi added a comment -

          Attaching new patch fixing the release audit warning.

          Show
          Ravi Gummadi added a comment - Attaching new patch fixing the release audit warning.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12479884/2407.v1.patch
          against trunk revision 1125223.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 10 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/284//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/284//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/284//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479884/2407.v1.patch against trunk revision 1125223. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 10 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/284//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/284//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/284//console This message is automatically generated.
          Hide
          Amar Kamat added a comment -

          The latest patch looks good to me. I have some minor comments (mostly alignment, refactoring and parameter naming) which I have discussed with Ravi offline. I don't want to block the patch just for some minor comments. +1.

          Show
          Amar Kamat added a comment - The latest patch looks good to me. I have some minor comments (mostly alignment, refactoring and parameter naming) which I have discussed with Ravi offline. I don't want to block the patch just for some minor comments. +1.
          Hide
          Ravi Gummadi added a comment -

          Attaching new patch updating Amar's offline minor comments.

          Show
          Ravi Gummadi added a comment - Attaching new patch updating Amar's offline minor comments.
          Hide
          Amar Kamat added a comment -

          Patch looks good to me. +1

          Show
          Amar Kamat added a comment - Patch looks good to me. +1
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12480093/2407.v1.1.patch
          against trunk revision 1125599.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 10 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/291//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/291//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/291//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12480093/2407.v1.1.patch against trunk revision 1125599. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 10 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/291//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/291//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/291//console This message is automatically generated.
          Hide
          Ravi Gummadi added a comment -

          I just committed this to trunk.

          Show
          Ravi Gummadi added a comment - I just committed this to trunk.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #695 (See https://builds.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/695/)
          MAPREDUCE-2407. Make GridMix emulate usage of distributed cache files in simulated jobs.

          ravigummadi : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1126499
          Files :

          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/PseudoLocalFs.java
          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/DistributedCacheEmulator.java
          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/test/org/apache/hadoop/mapred/gridmix/TestPseudoLocalFs.java
          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/Gridmix.java
          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/GenerateData.java
          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/JobCreator.java
          • /hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/gridmix.xml
          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/GenerateDistCacheData.java
          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/test/org/apache/hadoop/mapred/gridmix/DebugJobProducer.java
          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/test/org/apache/hadoop/mapred/gridmix/TestDistCacheEmulation.java
          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/test/org/apache/hadoop/mapred/gridmix/TestGridmixSubmission.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #695 (See https://builds.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/695/ ) MAPREDUCE-2407 . Make GridMix emulate usage of distributed cache files in simulated jobs. ravigummadi : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1126499 Files : /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/PseudoLocalFs.java /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/DistributedCacheEmulator.java /hadoop/mapreduce/trunk/src/contrib/gridmix/src/test/org/apache/hadoop/mapred/gridmix/TestPseudoLocalFs.java /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/Gridmix.java /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/GenerateData.java /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/JobCreator.java /hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/gridmix.xml /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/GenerateDistCacheData.java /hadoop/mapreduce/trunk/src/contrib/gridmix/src/test/org/apache/hadoop/mapred/gridmix/DebugJobProducer.java /hadoop/mapreduce/trunk/src/contrib/gridmix/src/test/org/apache/hadoop/mapred/gridmix/TestDistCacheEmulation.java /hadoop/mapreduce/trunk/src/contrib/gridmix/src/test/org/apache/hadoop/mapred/gridmix/TestGridmixSubmission.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #689 (See https://builds.apache.org/hudson/job/Hadoop-Mapreduce-trunk/689/)
          MAPREDUCE-2407. Make GridMix emulate usage of distributed cache files in simulated jobs.

          ravigummadi : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1126499
          Files :

          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/PseudoLocalFs.java
          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/DistributedCacheEmulator.java
          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/test/org/apache/hadoop/mapred/gridmix/TestPseudoLocalFs.java
          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/Gridmix.java
          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/GenerateData.java
          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/JobCreator.java
          • /hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/gridmix.xml
          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/GenerateDistCacheData.java
          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/test/org/apache/hadoop/mapred/gridmix/DebugJobProducer.java
          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/test/org/apache/hadoop/mapred/gridmix/TestDistCacheEmulation.java
          • /hadoop/mapreduce/trunk/src/contrib/gridmix/src/test/org/apache/hadoop/mapred/gridmix/TestGridmixSubmission.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #689 (See https://builds.apache.org/hudson/job/Hadoop-Mapreduce-trunk/689/ ) MAPREDUCE-2407 . Make GridMix emulate usage of distributed cache files in simulated jobs. ravigummadi : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1126499 Files : /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/PseudoLocalFs.java /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/DistributedCacheEmulator.java /hadoop/mapreduce/trunk/src/contrib/gridmix/src/test/org/apache/hadoop/mapred/gridmix/TestPseudoLocalFs.java /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/Gridmix.java /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/GenerateData.java /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/JobCreator.java /hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/gridmix.xml /hadoop/mapreduce/trunk/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/GenerateDistCacheData.java /hadoop/mapreduce/trunk/src/contrib/gridmix/src/test/org/apache/hadoop/mapred/gridmix/DebugJobProducer.java /hadoop/mapreduce/trunk/src/contrib/gridmix/src/test/org/apache/hadoop/mapred/gridmix/TestDistCacheEmulation.java /hadoop/mapreduce/trunk/src/contrib/gridmix/src/test/org/apache/hadoop/mapred/gridmix/TestGridmixSubmission.java

            People

            • Assignee:
              Ravi Gummadi
              Reporter:
              Ravi Gummadi
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development